High-Performance Low-Power Cache Memory
Architectures
Koji Inoue
Kyushu University
January 2001
Abstract
Recent remarkable advances of VLSI technology have been increasing processor speed and
DRAM capacity. However, the advances also have introduced a large, growing performance
gap between processor and main memory. Cache memories have long been employed on pro-
cessor chips in order to bridge the processor-memory performance gap. Therefore, researchers
have made great efforts to improve the cache performance.
However, the surroundings of processor-chip design have been changing. 1) Recent growing
mobile-market strongly requires not only high performance but also low-energy dissipation
for expanding the battery life. 2) Recent VLSI technology have made it possible to integrate
processor and main memory into the same chip, so that the chip boundary between cache
and main memory can be eliminated. The changes suggest that we need to keep considering
cache architectures for high-performance, low-energy computer systems.
Reducing the frequency of off-chip accesses has mainly two advantages: reducing memory-
access latency and reducing energy dissipation for driving external I/O pins. The most
straightforward way to improve the performance/energy efficiency of memory systems is to
invest the increasing transistor budget in the cache memories (increasing cache capacity).
Increasing cache capacity improves cache-hit rates, so that more memory accesses can be
confined in on-chip. However, it also leads to increase in cache-access latency, which is the
time wasted to access the cache, and cache-access energy, which is the energy dissipated for
a cache access. Since almost all memory accesses concentrate in cache memories, improving
performance/energy efficiency of cache memories is one of the most important challenges.
This thesis introduces adaptive cache management techniques for high performance, low-
energy processor chips. The caches proposed in this thesis attempt to eliminate unnecessary
operations for reducing energy dissipation and improving performance.
In the first part of this thesis, we introduce a cache architecture for reducing cache-access
i
ii ABSTRACT
energy, called way-predicting set-associative cache. In conventional set-associative caches,
all ways are searched in parallel because the cache-access time is critical. In fact, on a
cache hit, only one way has the data desired by the processor. Therefore, the access to the
remaining ways is unnecessary. The way-predicting set-associative cache attempts to avoid
the unnecessary way activation, and has the following features.
• The cache has a way-prediction table. Each entry in the table is used for speculative
way selection.
• Before a cache access is started, the way-prediction table is accessed to get a hint for
the speculative way selection.
• Only the predicted way is activated (searched).
• If the way-prediction is correct, the cache access is completed in one cycle. As the
remaining ways are not activated, the energy dissipation for the cache access can be
reduced.
• If the cache makes an wrong way-prediction, then the remaining ways are searched in
the same manner as conventional set-associative caches. In this case, the cache can not
make any energy reduction. In addition, the cache wastes one more cycle to access the
remaining ways.
We evaluate the performance/energy efficiency of way-predicting set-associative caches. The
way-predicting scheme reduces more than 70 % of cache-access energy, while it occurs only
less than 10 % of cache-access-time overhead. In addition, we evaluate the effects of hard-
ware constraint to the way-predicting set-associative cache, and conclude that our scheme is
promising for future processor chips which employ large on-chip caches.
In the second part of this thesis, we introduce a cache architecture for reducing cache-access
energy, called history-based tag-comparison cache. In conventional caches, tag comparison is
performed in every access in order to determine whether the access hits the cache. The content
of cache is updated only when a cache miss takes place. Therefore, if an instruction block was
executed before, and if there has never been cache misses since the previous execution of the
instruction block, then it is guaranteed that the instruction block is currently cache resident.
In this case, we do not need to perform tag comparison, so that the energy dissipated for
ABSTRACT iii
performing tag comparison can be completely eliminated. The history-based tag-comparison
cache attempts to avoid unnecessary tag comparisons, and has the following features.
• Execution footprints are recorded in an extended BTB (Branch Target Buffer).
• When a branch is executed, a corresponding footprint is recorded. The footprint denotes
that the target instruction block of the branch is currently cache resident.
• When the branch is executed again, the corresponding footprint is checked. If the cache
detects the recorded footprint, then the tag comparison in cache accesses for the target
instruction block is omitted.
• When a cache miss takes place, all execution footprints are erased.
• Since hardware components for the history-based tag-comparison cache do not appear
on cache critical paths, the cache-access time of conventional organization is maintained.
We evaluate the energy efficiency of history-based tag-comparison caches. In best case, a
history-based tag-comparison cache reduces 99 % of tag-comparison energy for the execu-
tion of a program. Since the tag omitting scheme relies on loop structure in programs,
our cache works well for floating-point programs and media programs which have relatively
well structured loops. Although our cache does not make a significant reduction of tag-
comparison energy for some integer programs, increasing cache capacity improves the effect
of tag omitting scheme. Therefore, we conclude that the history-based tag-comparison cache
is promising for future processor chips which employ large on-chip caches.
In the last part of this thesis, we introduce high performance, low energy techniques for on-
chip memory systems, called dynamically variable line-size cache. For merged DRAM/logic
LSIs with a memory hierarchy including cache memory, we can exploit high on-chip memory
bandwidth by means of replacing a whole cache line at a time on cache misses. This ap-
proach tends to increase the cache-line size if we attempt to improve the attainable memory
bandwidth. Although larger cache lines give an effect of prefetching, it may worsen cache-hit
rates if programs do not have enough spatial locality. The dynamically variable line-size
cache attempts to avoid unnecessary data replacements, which is caused by large cache lines,
by adjusting the cache-line size according to the degree of spatial locality. The cache has the
following features:
iv ABSTRACT
• A large cache line is partitioned into small cache lines (sublines).
• When rich spatial locality is observed, a large number of sublines are involved in cache
replacements (assemble a large cache line). In contrast, when poor spatial locality is
observed, a small number of sublines are involved in cache replacements (assemble a
small cache line).
• As conflict misses are reduced by not increasing associativity but reducing cache-line
size, high-speed cache access can be maintained.
• Data transfer between the cache and main memory can be completed in a constant
time regardless of the cache-line sizes because of the high on-chip memory bandwidth
on merged DRAM/logic LSIs.
• Only the DRAM subarrays corresponding to the sublines to be replaced are activated,
thereby saving the main-memory-access energy for cache replacements.
We evaluate the performance/energy efficiency of dynamically variable line-size caches hav-
ing 32-byte, 64-byte, and 128-byte cache-line sizes. For a benchmark set which consists of
two integer programs and one floating-point program, a dynamically variable line-size cache
reduces average memory-access time by 20 %, while it makes 35 % average memory-access en-
ergy reduction, compared with a conventional cache having fixed 128-byte cache-line size. In
addition, we investigate the effects of on-chip DRAM characteristics which depend strongly
on device technology, and it is observed that the dynamically variable line-size cache can
make significant performance/energy improvements in a wide range of on-chip DRAM access
speed and energy. Therefore, we conclude that the dynamically variable line-size cache is
promising for future processor chips using merged DRAM/logic LSIs.
When we focus on portable computing in world-wide network systems, it is required to
provide software portability. Our caches monitor the behavior of memory references, then
attempt to avoid unnecessary operations at run-time. Since the caches do not require any
modification of instruction set architectures, the full compatibility of existing object codes
can be kept.
Contents
Abstract i
Contents v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Memory Systems Employing Cache Memories 9
2.1 Principle of Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Memory-Access Time and Energy Definitions . . . . . . . . . . . . . . . . . . . 10
2.3 Conventional Cache Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 High-Speed Memory-Access Techniques . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Making Cache Access Faster . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Making Cache-Miss Rate Lower . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 Making Cache-Miss Penalty Smaller . . . . . . . . . . . . . . . . . . . . 24
2.5 Low-Energy Memory-Access Techniques . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Reducing Cache-Access Energy . . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Reducing Data-Transfer/Main-Memory-Access Energy . . . . . . . . . 30
2.5.3 Reducing DRAM-Static Energy . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Way-Predicting Set-Associative Cache Architecture 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
vi CONTENTS
3.2.1 Conventional Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Phased Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Way-Predicting Set-Associative Cache . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Way Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Evaluations: Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Evaluations: Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Way-Prediction Hit Rates . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.3 Cache-Access Time and Energy . . . . . . . . . . . . . . . . . . . . . . 46
3.5.4 Energy-Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.5 Performance/Energy Overhead . . . . . . . . . . . . . . . . . . . . . . 51
3.5.6 Effects of Other Parameters . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 History-Based Tag-Comparison Cache Architecture 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Breakdown of Cache-Access Energy . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Interline Tag-Comparison Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 History-Based Tag-Comparison Cache . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.2 Energy Reduction for Tag Comparisons . . . . . . . . . . . . . . . . . . 73
4.5.3 Effects of Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . 76
4.5.4 Energy Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
CONTENTS vii
5 Variable Line-Size Cache Architecture 83
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Conventional Approaches to Exploiting High Memory-Bandwidth . . . . . . . 85
5.3 Variable Line-Size Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Concept and Principle of Operations . . . . . . . . . . . . . . . . . . . 88
5.3.3 Line Size Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Statically Variable Line-Size Cache . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.3 Line-Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 Dynamically Variable Line-Size Cache . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.5.3 Line-Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6.2 Cache-Access Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.3 Cache-Access Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6.4 Cache-Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6.5 Main-Memory-Access Time and Energy . . . . . . . . . . . . . . . . . . 106
5.6.6 Average Memory-Access Time . . . . . . . . . . . . . . . . . . . . . . . 109
5.6.7 Average Memory-Access Energy . . . . . . . . . . . . . . . . . . . . . . 110
5.6.8 Energy–Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6.9 Hardware Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6.10 Effects of Other Parameters . . . . . . . . . . . . . . . . . . . . . . . . 116
5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 Conclusions 125
Acknowledgment 129
viii CONTENTS
Bibliography 131
List of Publications by the Author 143
Chapter 1
Introduction
1.1 Motivation
VLSI technologies have been increasing processor speed and DRAM capacity dramatically.
For example, the implementations of 1 GHz processors and 1 Gbits DRAM have been reported
[24], [32], [10], [55]. However, they also have introduced a large, growing performance gap
between processors and main memory (DRAM). By improving not only the clock speed but
also instruction level parallelism (ILP), the processor performance has been improving at a
rate of 60 % per year. On the other hand, the access time to DRAM has been improving at
a rate of less than 10 % per year [72]. Moreover, current memory systems suffer from a lack
of memory bandwidth caused by I/O-pin bottleneck. This problem is known as “Memory
Wall” [12], [97]. The inability of memory systems causes poor total system performance in
spite of higher processor performance.
Cache memories have been playing an important role in bridging the performance gap be-
tween high-speed processor and low-speed off-chip main memory, because confining memory
accesses in on-chip reduces memory access latency. Much research has focused on improv-
ing cache performance, and many high-performance cache architectures have been proposed.
However, the processor–memory performance gap is still growing. Patterson et al. [72] ana-
lyzed the breakdown of execution time (ET ) for benchmark programs as shown in Figure 1.1.
The memory hierarchy in the Alpha system includes up to level-3 caches. In their results,
it can be observed that the database and matrix computation programs spend about 75 %
time in memory accesses. In this case, 20 % memory-performance improvement achieves 15
1
2 CHAPTER 1. INTRODUCTION
0
25
50
75
100
SPECint92 SPECfp92 DataBase Sparse
Tim
e [%
]
ProcessorI-Cache MissesD-Cache MissesL2 Cache MissesL3 Cache Misses
Benchmark Programs
Figure 1.1: Fraction of Time Spent in Each Component on the Alpha 21164.
% ET reduction. In other words, 20 % memory-system performance degradation worsens
the total system performance by 15 %. Of cause, the effect of memory-system performance
to ET depends on the characteristics of target programs, for example, the total count of
load/store instructions executed, issue rate of instructions, and so on. Actually, the time
spent in memory accesses for SPEC programs are from 20 % to 30 % on the Alpha 21164
system. However, the inability of memory systems will increase the processor-memory per-
formance gap, and is clearly a serious problem for future processor-based computer systems.
Accordingly, we still need to make a great effort to improve the memory-system performance
by developing efficient cache memories.
The most straightforward approach to improving the memory-system performance is to
increase cache size. In order to alleviate the inability of memory systems, the trend is to invest
the increasing transistor budget in cache capacity. Increasing the cache capacity reduces the
frequency of off-chip accesses due to improving cache-hit rates. From energy point of view,
this approach seems to be useful because the energy dissipated for driving external I/O pins
1.2. CONTRIBUTIONS 3
can be reduced. However, this approach also increases the energy dissipated in cache accesses.
When we focused on power (the amount of energy consumed per unit time) of caches, several
studies were reported. The power consumption of on-chip caches for StrongARM SA110
occupy 43% of the total chip power [77]. In the 300 MHz bipolar CPU reported by Jouppi
et al. [44], 50 % of power is dissipated by caches. Recent growing mobile-market strongly
requires not only high performance but also low-energy dissipation. One of uncompromising
requirements of portable computing is energy efficiency, because that affects directly the
battery life. Therefore, from these studies, we believe that considering low-energy cache
architectures is a worthwhile work for future processor systems.
1.2 Contributions
Cache memories are indispensable for high-performance, low-energy processor chips. How-
ever, it is difficult to improve performance/energy efficiency of cache memories by relying
only on the advanced VLSI technologies. As explained in Section 1.1, although increasing
cache capacity improves cache-hit rates, it wastes a lot of energy per cache access. Moreover,
it also makes cache-access time longer [96]. Therefore, in case that the advantage of increase
in the cache-hit rate is smaller than the disadvantage caused by cache-access-time and cache-
access-energy overhead, we can not obtain any improvement of performance/energy efficiency.
There is another example on merged DRAM/logic LSIs. Eliminating the chip boundary be-
tween processor (with cache) and main memory makes it possible to exploit high on-chip
memory bandwidth. However, exploiting the maximum ability of memory bandwidth is not
always acceptable. Thrashing phenomenon may occur between the cache and main memory
because of unnecessary data replacements, thereby wasting the time and energy.
Since cache memories affect all memory references, we have to pay significant attention to
microarchitectures for improving performance/energy efficiency of cache memories. The goal
of this thesis is to propose and develop high-performance, low-energy cache architectures.
The role of cache memory is to serve read or write requirements from the processor as soon
as possible. However, as conventional caches employ conservative mechanisms, there are
many unnecessary operations. The unnecessary operations waste much energy and time. In
order to eliminate the unnecessary operations, our caches stand on the following strategy:
4 CHAPTER 1. INTRODUCTION
1. Monitor memory-reference behavior at run-time.
2. Predict and Detect unnecessary operations for future accesses by analyzing the mon-
itored memory-reference behavior at run-time.
3. Eliminate the unnecessary operations at run-time.
The key of our approach is to optimize the cache operation to the characteristics of target
programs at run-time. As our scheme does not require any modification of instruction-set
architectures, the full compatibility of existing object codes can be kept.
Three major contributions of this thesis are described below:
• Way-predicting (WP) set-associative cache: A cache architecture for low energy
dissipation is proposed and evaluated. The cache attempts to eliminate unnecessary
way activation in set-associative caches. In conventional set-associative caches, all ways
are searched in parallel because the cache-access time is critical. The way-predicting
set-associative cache predicts which way has the data desired by the processor before
starting the cache access. The way prediction is performed based on memory-access
history. As the way-predicting set-associative cache can maintain the cache-hit rate
of conventional organization, the latency and energy overhead for next-level memory
accesses do not appear.
• History-based tag-comparison (HTC) cache: A cache architecture for low energy
dissipation is proposed and evaluated. The cache predicts whether the instructions to be
fetched currently cache resident, and attempts to eliminate unnecessary tag comparison.
In conventional caches, tag comparison has to be performed on every cache access
in order to test whether the memory reference hits the cache. Execution footprints
recorded in a BTB (branch target buffer) is used for the prediction. As the history-
based tag-comparison cache does not affect both the cache-hit rate and the cache-access
time, the memory-system performance of conventional organization can be maintained.
• Dynamically variable line-size (DVLS) cache: A cache architecture for high-
performance/low-energy on merged DRAM/logic LSIs is proposed and evaluated. The
cache predicts the degree of spatial locality, and attempts to avoid performing unnec-
essary data replacements. For merged DRAM/logic LSIs with a memory hierarchy
1.2. CONTRIBUTIONS 5
Table 1.1: Characteristics of Proposed Cache Architectures.
Caches What to Monitor What to Predict What to Eliminate
WP MRU (Most-Recently-Used) a way to be accessed unnecessary
ways way activation
HTC program execution whether or not an instruction unnecessary
sequence to be fetched next tag comparisons
has already resided
DVLS cache-line reference the degree of spatial locality unnecessary
history data replacements
including cache memory, we can exploit high on-chip memory bandwidth by means of
replacing a whole cache line at a time on cache misses. This approach tends to in-
crease the cache-line size if we attempt to improve the attainable memory bandwidth.
In general, large cache lines can benefit some programs as the effect of prefetching.
Larger cache lines, however, might worsen the system performance if programs do not
have enough spatial locality because cache-conflict misses frequently take place. As
a result, widen on-chip buses and DRAM array will waste not only time but also a
lot of energy because of a number of main-memory accesses. Although conflict misses
can be reduced by increasing the cache associativity, this approach usually makes the
cache-access time longer. In the dynamically variable-line size cache, the large cache
line is partitioned into multiple small cache lines (sublines), and the cache attempts
to adjust the number of sublines to be involved on cache replacements. Namely, the
cache tries to optimize the cache-line size according to the degree of spatial locality
observed. Reducing the cache-line size alleviate the negative effect of large cache lines
without cache-access-time overhead. In addition, selective activation of on-chip buses
and DRAM subarrays, which correspond to the replaced sublines, reduces energy dis-
sipation for cache replacements.
Table 1.1, Table 1.2, and Table 1.3 summarize the characteristics, usability, and effects of
the proposed cache architectures, respectively.
6 CHAPTER 1. INTRODUCTION
Table 1.2: Usability of Proposed Cache Architectures.
Caches Instruction Cache Data Cache
Direct-Map Set-Associative Direct-Map Set-Associative
WP –√
–√
HTC√
– – –
DVLS√ √ √ √
Table 1.3: Effects of Proposed Cache Architectures.
Cache Accesses Main-Memory Accesses
Caches Time EnergyCache-Miss Rate
Time Energy
WP *) ↗ (5%) ↘ (72%) → → →
HTC **) → ↘ (30%) → → →
DVLS ***) → → ↘ (37%) → ↘ (52%)
*) compared with a conventional four-way set-associative data cache for 124.m88ksim.
*) compared with a conventional direct-mapped (DM) instruction cache for 107.mgrid.
**) compared with a conventional DM data cache with 128-byte lines for MIX-IntFp.
1.3 Overview
This thesis introduces adaptive cache memory architectures for high performance and low
energy dissipation, and is organized as follows. Chapter 2 briefly explains the principle of
memory hierarchy to confirm the most important characteristics of memory-reference behav-
1.3. OVERVIEW 7
ior, and defines metrics to evaluate the performance/energy efficiency of memory systems.
In addition, Chapter 2 surveys high-performance techniques and low-energy techniques for
cache memories. The way-predicting set-associative cache architecture and the history-based
tag-comparison cache architecture for low energy consumption are introduced in Chapter
3 and Chapter 4, respectively. Chapter 5 presents the dynamically variable line-size cache
architecture. Finally, Chapter 6 concludes this thesis.
8 CHAPTER 1. INTRODUCTION
Chapter 2
Memory Systems Employing Cache
Memories
2.1 Principle of Memory Hierarchy
Total system performance suffer from the inability of memory systems as explained in Chapter
1. If we can employ an ideal memory system which has infinite memory space and any memory
access can be completed within one processor clock cycle, the total system performance will
be improved dramatically. However, this assumption is impracticable in real memory systems
due to the restricted hardware budget, the limit of process technology, and so on. Employing
a memory hierarchy is a well known technique to make the real memory system close to the
ideal one.
There is a rule for program-execution behavior called 90/10 Locality Rule: a program
executes about 90 % of its instructions in 10 % of its code. We can understand from the rule
that there are some portions of program-address space executed frequently. Thus, programs
exhibit locality as follows [28]:
• Temporal locality: If an item is referenced, it will tend to be referenced again soon.
• Spatial locality: If an item is referenced, nearby items will tend to be referenced soon.
The principle of memory hierarchy stands on the locality of memory references. There are
many levels in a memory hierarchy. Data replacements are performed between adjacent levels.
Upper levels are smaller, faster, and closer to the processor than lower levels. An upper level
9
10 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
consists of a part of memory space in the next-lower level. The processor tries to obtain the
reference data from the closest level, because that memory access can be completed faster.
If the processor can not find the data at that level, then the next lower level is searched. In
case that the required data is found at the lower level, a copy of the data is stored into the
upper level. After that, the accesses to the stored data can be completed at the upper level
until the data is evicted.
Here, we consider the locality of references, again. If programs have rich locality of memory
references, almost all accesses can be completed at upper levels in the memory hierarchy. Only
when the access misses the upper level, the next lower level are searched. Usually, accesses
to the highest level, or level-1 cache, can be completed in one clock cycle of high-speed
processor. Therefore, the real memory system can behave as the ideal memory system if
almost all memory accesses are confined in the level-1 cache.
2.2 Memory-Access Time and Energy Definitions
We consider a memory hierarchy which consists of a cache memory implemented by a Static
RAM and a main memory implemented by a Dynamic RAM. Note that the lowest level of
the memory hierarchy is 2. Cache-miss rate is the most popular metric of cache performance.
However, it is very important to consider not only the cache-miss rate but also cache-access
time. Since the cache-access time affects all load/store operations, it has a great impact
on total memory-system performance. In this thesis, we use average memory-access time
(AMAT ), which is the average latency per memory reference [28]. The average memory-
access time can be expressed by the following equations:
AMAT = TCache + CMR × 2 × TMainMemory, (2.1)
TMainMemory = TDRAMarray +LineSize
BandWidth, (2.2)
where, CMR is cache-miss rate (note that cache-hit rate is represented as CHR in this thesis).
Cache-access time denoted as TCache is the latency for determining whether a memory access
hits the cache and for providing the referenced data to the processor on a cache hit. Main-
memory-access time denoted as TMainMemory, or miss penalty, is the latency for an access to
the main memory. On a cache miss, if the cache employs write-back policy for cache-line
2.2. MEMORY-ACCESS TIME AND ENERGY DEFINITIONS 11
replacement, two main-memory accesses take place (one for write-back and one for refill)
in the worst case. The main-memory-access time (TMainMemory) consists of two factors: the
latency for an access to the DRAM array (TDRAMarray) and that for transferring a cache-line
between the cache and the main memory ( LineSizeBandWidth
). LineSize and BandWidth are a cache-
line size to be replaced and memory bandwidth between the cache and the main memory,
respectively.
On the other hand, the total energy consumed for the execution of a program consists
of two parts: energy dissipated in CPU core and that in memory hierarchy denoted as
EMemoryHierarchy. We assume that the total count of load/store instructions in the execution
of a program is a constant. Therefore, EMemoryHierarchy depends only on the energy efficiency
of the memory system, and can be approximated by the following equation:
EMemoryHierarchy =N∑
i=1
EMAi, (2.3)
where, N is the count of load/store instructions executed, and EMAi is the memory-access
energy which is dissipated by the memory system to serve ith memory-access operation. In
this thesis, we use average memory-access energy (AMAE), which is the average energy
dissipated per memory reference (i.e., EMemoryHierarchy = N × AMAE). AMAE can be
expressed by the following equations:
AMAE = ECache + CMR × 2 × EMainMemory, (2.4)
EMainMemory = EDRAMarray + EDataTransfer , (2.5)
where, ECache denotes cache-access energy which is the average energy dissipation per cache
access, and EMainMemory denotes main-memory-access energy which is the average energy
dissipation per main-memory access. On a cache miss, two main-memory accesses, one for
write-back and one for refill, consume the energy of 2 × EMainMemory in the worst case. The
main-memory-access energy consists of two factors: energy for accessing to the DRAM array
(EDRAMarray) and that for transferring a cache-line between the cache and the main memory
(EDataTransfer). Moreover, the cache-access energy (ECache) can be approximated by the
following equation [85]:
ECache = EDecode + ESRAMarray, (2.6)
12 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
where, EDecode is the average energy consumed for decoding the memory address, and ESRAMarray
is that consumed for accessing to the SRAM array (tag memory and data memory), per cache
access. The energy model described in [85] includes the energy consumed for driving external
I/O pins, and that energy is included in EDataTransfer in Equation (2.5). EDecode depends
on the switching activity of memory addresses generated by the processor, and is negligible
compared to the ESRAMarray. Previous papers reported that the energy consumption of the
address decoder is about three order of magnitude smaller than that of other components
[4],[54]. Therefore, we assume that the cache-access energy (ECache) depends on the energy
consumed for accessing to the SRAM array (ESRAMarray).
There are many levels where we can consider for improving the memory-system perfor-
mance and energy dissipation: device level, circuit level, architecture level, algorithm level,
and so on. In the following sections, we briefly survey architectural techniques for high
performance, low energy memory systems. Before presenting these techniques, we show con-
ventional cache architectures in Section 2.3. Then, high performance techniques and energy
reduction techniques are introduced in Section 2.4 and Section 2.5, respectively.
2.3 Conventional Cache Architectures
Mainly, there are two kind of cache architectures: a direct-mapped cache architecture and
a set-associative cache architecture. Figure 2.1 depicts the conventional organization of a
direct-mapped cache and a two-way set-associative cache. The block of data which can be
replaced between the cache and the main memory is called a cache line (or line). A set
consists of cache lines which have the same cache-index address. We can regard the direct-
mapped cache as an one-way set-associative cache. Each way consists of a tag-subarray and
a data-subarray for memorizing tags and cache lines, respectively. An n-way set-associative
cache works as follows [28]:
1. As soon as an effective memory address is generated by the processor, the cache starts
to decode the memory address and determines the set to be searched.
2. The cache starts simultaneously to read both the tag and the cache line designated
by the cache-index address from each way. Then the tags are compared with the tag-
portion in the memory address in order to test whether at most one of tags matches
2.3. CONVENTIONAL CACHE ARCHITECTURES 13
Processor
Address
Tag Index Offset
Hit / Miss?
Set
Mux
Load/Store Data
Processor
Address
Tag Index Offset
Hit / Miss?OR
Set
Mux
Tag
Load/Store Data
Cache LineTag
(a) Direct-Mapped Cache (b) 2-Way Set-Associative Cache
Tag-Subarray
Data-Subarray
Way1Data MemoryTag Memory Way2
Figure 2.1: Conventional Cache Architectures.
the tag-portion. All tag comparisons are performed in parallel because speed is critical.
3. If a match is found (i.e., on a cache hit), the cache provides the word data in the
associated cache line to the processor (for read). Otherwise, a cache-line replacement
takes place.
Compared to direct-mapped caches (i.e., one-way set-associative caches), n-way set-associative
caches (n ≥ 2) usually can produce higher cache-hit rates (reduce CMR in Equation (2.1)),
because higher associativity reduces conflict misses. However, increasing the cache associa-
tivity (i.e., increasing n) suffers from the following drawbacks:
• The cache-access time (TCache in Equation (2.1)) tends to be larger because the n-way
set-associative cache incurs an additional delay for way selection [30], [96]. The way
selection has to be performed after tag-comparison results are available. Therefore, if
the delay for the tag comparison is larger than that for reading the cache-line data, the
cache-access-time overhead due to the way selection appears.
14 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
• The cache-access energy (ECache in Equation (2.4)) tends to be larger [4], [29]. Although
at most only one way has the data desired by the processor, all the ways are accessed in
parallel. Increasing the cache associativity decreases the total number of word-lines in
the SRAM array (the height of the SRAM array), as shown in Figure 2.1. However, the
total number of bit-lines to be activated is increased. As a result, activating peripheral
circuits which are added to each bit-line, for example bit-line precharging circuit, sense
amplifier, and so on, increases the cache-access energy.
2.4 High-Speed Memory-Access Techniques
As explained in Section 2.3, there is a trade-off between the cache-access time and the cache-
hit rate: first access but low hit rate of direct-mapped caches vs. slow access but high hit rate
of set-associative caches. From Equation (2.1), it can be understood that there are at least
three approaches to improving the memory-system performance (to reducing the average
memory-access time) as follows.
• Reducing cache-access time (TCache), and maintaining cache-miss rate and miss penalty
as can as possible.
• Reducing cache-miss rate (CMR), and maintaining cache-access time and miss penalty
as can as possible.
• Reducing miss penalty (TMainMemory), and maintaining cache-access time and cache-
miss rate as can as possible.
In this section, we introduce techniques to satisfy the above requirements for high-performance
memory systems. Section 2.4.1 and Section 2.4.2 show techniques to improve the cache-access
time and the cache-hit rate, respectively. Then, Section 2.4.3 focuses on how to reduce the
cache-miss penalty for cache-line replacements.
2.4.1 Making Cache Access Faster
The most significant disadvantage of set-associative caches is to suffer from longer access
time due to way selection. Since the way selection can be performed after the tag-comparison
2.4. HIGH-SPEED MEMORY-ACCESS TECHNIQUES 15
results are available, the critical path becomes long. The key of techniques introduced in this
section is to complete the way selection as soon as possible.
2.4.1.1 Speculative Way Selection: Exploiting Locality
There are two methods to search the desired way in set-associative caches: parallel search and
sequential search. The parallel search examines all ways in parallel. Thus, the delay for the
way selection based on tag comparison makes cache-access time longer. The sequential search
examines one by one until it finds the desired way. Therefore, the way-selection overhead
can be eliminated if the first examine finds the desired way. In this case, the cache-access
time is as fast as direct-mapped caches. However, in the worst case, the sequential search
may require the same number of clock cycles as the associativity. Namely, the cache-access
time depends on how fast the cache can find the desired way.
Kessler et al. [52] proposed a set-associative MRU cache which uses hardware similar to
a direct-mapped cache. The MRU cache employs MRU-order-based sequential search. The
MRU information is stored in a mapping table. Chang et al. [16] proposed another MRU
cache, which is employed in System/370, to improve the access time of parallel-search set-
associative caches. Chang et al. reported that where a 128 KB cache has 64 associativity
(i.e., 64-way set-associative cache), more than 80 % of the overall memory references hit
the MRU region, even if the size of which is only 2 KB (128 KB / 64 way). The MRU
information for each set is used to select one way before the tag comparison is completed.
When a cache access is issued, the way designated by the corresponding MRU information is
selected. Where the cache selects an wrong way, two cycles are required due to accessing to
the remaining ways. Kessler et al. also reported that the MRU scheme achieves more than
30 % cache-access time improvement from a conventional four-way set-associative cache.
2.4.1.2 Speculative Way Selection: Partial Tag Comparison
Another approach to improving the access time of set-associative caches is to obtain the
tag-comparison results as soon as possible. If the control signals for the way selection are
available before cache-lines are completely read out, the cache-access-time overhead caused
by the way selection can be hidden.
16 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
Partial address matching proposed by Liu [60] is one of approaches to reducing the tag-
comparison time. Their cache has two memory arrays: MD (for “Main Directory”) and PAD
(for “Partial Address Directory”). MD contains complete tag information, whereas PAD
contains a part of tag bits (e.g., 5 bits). First, tag-comparison results from PAD are used for
the way selection. The complete tag comparison on MD is also performed in parallel, but it
is for the verification of the partial address tag-comparison. If the cache detects an wrong
way selection, the incorrectly accessed data is canceled. The timing advantage for partial
address matching comes from the simpler comparators and fast data read of small number of
bits. They reported that reading 5 partial address bits from cache directory can be almost
twice as fast as reading 18 full address bits.
Juan et al. [45] proposed the difference-bit cache. The idea is based on the fact that
the two stored tags corresponding to a set have to differ in at least one bit. By using the
1-bit (difference-bit) comparison result, the way selection can be performed. Diff memory
is employed to record the position and value of the difference-bit in the tag. Note that the
difference-bit comparison can be used for the way selection, but not for testing the cache hit
or miss. In case of two-way set-associative caches, two tags are read in parallel. After that,
one of tags is selected by using the difference-bit comparison result, then it is compared with
the tag portion in memory address in order to determine the memory access hits the cache.
Data selection for providing the desired data to the processor is also performed by using the
difference-bit comparison result, instead of complete tag-comparison result.
2.4.2 Making Cache-Miss Rate Lower
Memory access behavior varies within and among program executions. However, conventional
caches expect that all memory references have the high degree of temporal and spatial locality.
Thus, conventional organization have hardware parameters fixed: cache size, associativity,
mapping function, replacement policy, cache-line size, and so on. Therefore, it is difficult
for the conventional caches to follow the various behavior of memory references. To improve
cache-hit rates, many researchers have proposed cache architectures which attempt to adapt
dynamically or statically the cache parameters to the varying memory-access behavior.
2.4. HIGH-SPEED MEMORY-ACCESS TECHNIQUES 17
2.4.2.1 Making Good Use of Cache Space
Unfortunately, conventional caches have only one mapping function for data placement. The
mapping function determines that which set the data designated by a memory address should
be placed in. In particular, a set in direct-mapped caches can include only one cache line.
Therefore, many data which compete in a set cause a large number of conflict misses. The
key of techniques introduced in this section is to employ several mapping functions to make
good use of limited cache space, thereby reducing the conflict misses.
(1) Employing Different Mapping Functions
The direct-mapped hash-rehash cache proposed by Agarwal et al. [2] attempts to avoid
conflict misses by using two different mapping functions. Conflicting data can be located
in a different set. When a cache access is issued, the first mapping function which is the
same as conventional direct-mapped cache is used to search the first entry. If the first search
finds a hit (i.e., first hit), the cache behaves as direct-mapped cache. Otherwise, the other
mapping function is used to search the second entry. Namely, the hash-rehash cache looks
like a two-way set-associative cache employing a sequential-search scheme. If the first and
second searches miss the cache (i.e., cache miss), the missed data is filled into the second
entry, and the first entry and the second entry are swapped for keeping the MRU cache-line
in the first location. Column-associative cache proposed by Agarwal et al.[1] has the same
configuration as the hash-rehash cache, except for a rehash bit in each set. The rehash bit
inhibits a rehash access in order to avoid secondary thrashing.
The hash-rehash cache and the column-associative cache worsen cache-miss rates than con-
ventional two-way set-associative caches with LRU replacement strategy. Because the mech-
anism for hash and rehash operations can not implement a true LRU replacement. Calder
et al.[14] proposed predictive sequential associative cache which has the steering bit table in
order to indicate which entry has to be searched first. In addition, the MRU information is
used for the complete LRU replacement strategy. Calder et al. also proposed how to predict
the cache-index in an earlier pipeline stage by using some prediction source, register contents,
offset, register number, and so on, in order to hide the steering-bit-table-access penalty.
Skewed-associative cache proposed by Seznec [80] improves the hit rate of 2-way set-
associative caches. A 2-way skewed-associative cache has the same configuration of a 2-way
conventional set-associative cache (i.e., there are two memory-subarrays), but it has two
18 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
mapping functions. Different mapping functions operate on difference memory-subarrays in
parallel. As the two mapping functions are more complex than those of the hash-rehash
cache, the skewed-associative cache can achieve higher cache-hit rates. Seznec reported that
a cache-miss rate produced by a 2-way skewed-associative cache is comparable with that
achieved by a 4-way conventional set-associative cache.
The other studies are sequential multi-column cache and parallel multi-column cache [99].
(2) Employing an Adaptive Mapping Function
Adaptive group-associative cache proposed by Peir et al. [74] attempts to intelligently use
the cache space. In conventional caches, a number of empty frame, or hole, exists in the
cache. The authors measured the average percentage of holes in various cache configurations
during the execution of a program, and it was observed that between 37.5 % and 42.6 % of
the cache are holes. In fact, these holes will be filled by rarely reused data. The idea of the
adaptive group-associative cache is to identify the existing holes and allocate the holes to
frequently reused data. On a cache miss, frequently reused data to be evicted from the cache
is moved into the hole, instead of the main memory. In other words, the cache optimizes
the mapping faction by detecting the holes at run-time. Peir et al. reported that the cache-
miss rate produced by an adaptive group-associative cache is comparable with that of a
fully-associative conventional cache for some of workloads.
2.4.2.2 Inhibiting Rarely Reused Data from Polluting Cache Space
As conventional caches load every data into the cache regardless of its reuse behavior, rarely
reused data pollute the limited cache space. Cache bypassing is one of approaches to solving
the problem. The missed data which has poor temporal locality is provided directly from the
main memory to the processor regardless of loading into the cache.
Johnson et al. [40] proposed a run-time adaptive cache management to improve the cache-
hit rates. Their cache employs amemory address table (MAT), in which the memory reference
behavior is recorded at run-time. Each entry in the MAT contains a counter in order to
identify the amount of temporal locality corresponding to the memory block. When the
value of the counter is smaller than a threshold, the reference data in the corresponding
memory block bypasses the cache.
McFarling [61] proposed dynamic exclusion replacement policy in order to reduce the num-
2.4. HIGH-SPEED MEMORY-ACCESS TECHNIQUES 19
ber of conflict misses in direct-mapped instruction caches. The cache presented in the paper
measures reference patterns. When two instructions compete for the same cache line, the
dynamic exclusion approach attempts to prohibit loading one instruction into the cache, so
that the other instruction can be kept in the cache. For the dynamic exclusion control, a
simple finite-state machine for each cache line is used. McFarling also proposed an instruc-
tion reordering technique based on compiler optimization in order to exclude less frequently
executed instructions.
Another approach to avoiding the cache pollution is to secure a part of cache space for
frequently reused data. Scratch-pad memory has been proposed to realize this kind of memory
management. The scratch-pad memory consists of a part of the main-memory space, and is
located at level-1 memory hierarchy. Accessing to the scratch-pad memory can be completed
in one clock-cycle as same as level-1 cache. The difference between the scratch-pad memory
and the level-1 cache is that no data replacement takes place in the scratch-pad memory.
Namely, 100 % hit rate for the scratch-pad memory accesses is guaranteed, whereas the level-
1 cache-hit rate depends on compulsory, capacity, and conflict misses. Which data should
be allocated into the scratch-pad memory space has to be determined before the program is
executed. Panda et al. [69] presented a technique for exploiting effectively the scratch-pad
memory. A careful partitioning of scalar and array variables into the main memory and the
scratch-pad memory improves the memory performance. Chiou et al. [17] proposed column
caching strategy that allows to restrict the data replacement at a column, or way, granularity.
Cache-line replacements on the restricted column is prohibited, so that the missed data
which is allocated to the restricted column bypasses the cache. We can regard the restricted
column as a scratch-pad memory. A software-controllable bit-vector specifies the replacement
restriction. Nakamura et al. [65] also proposed a software technique, called SCIMA: Software
Controlled Integrated Memory Architecture, for high performance computing. An on-chip
memory space is divided into two portions: cache and on-chip memory. The cache is under
the control of hardware as conventional caches, while the data replacement of on-chip memory
is controlled by software. Namely, the on-chip memory works as the scratch-pad memory.
Since the cache and the on-chip memory share the hardware memory structure, software can
attempt to change and optimize the ratio of their sizes.
20 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
2.4.2.3 Exploiting Different-Characteristics Memories
Researchers have proposed many cache architectures which consist of several memory modules
for improving cache-hit rates. The memory modules are used for different purposes in order
to follow the various behavior of memory references.
(1) Keeping and Filtering: Attaching a High-Associative Cache
There are many approaches employing a small high-associative cache. The roles of the
attached set-associative cache are 1) to keep frequently reused data at close to the level-1
cache instead of the next-level memory and 2) to filter rarely reused data which pollute the
cache. If a data has rich temporal locality, it should not be evicted from the cache. In
contrast, a data having poor temporal locality should not be loaded into the cache.
Jouppi [43] proposed the victim cache which is a small full-associative cache located at
between the direct-mapped level-1 cache (main cache) and the next-level memory (main
memory). When a cache line in the main cache is evicted, it is moved to the victim cache.
In case of a miss in the main cache that hits in the victim cache, the cache lines are swapped
between the main cache and the victim cache. Namely, the victim cache attempts to keep
the data, which is evicted form the main cache but probably has rich temporal locality, at
close to the processor.
Theobald et al. [86] discussed the design space of hybrid-access caches (the combination of
a direct-mapped main cache and a set-associative cache like the victim cache), and proposed
the half-and-half cache. The access time to the direct-mapped main-cache is faster than
that to the attached small set-associative cache. For example, the main-cache access can be
completed in one cycle, while the associative cache requires two cycles: one for normal access
and one for swapping between the main cache and the set-associative cache. Thus, there is
a trade-off between the cache-access time and the cache-hit rate when we consider the cache
resource distribution to the direct-mapped region and the set-associative region. Although
increasing the direct-mapped region (decreasing the set-associative region) increases conflict
misses, it may improve the average cache-access time by increasing the number of hits to
the direct-mapped region. The half-and-half cache uses the half of total cache capacity for
direct-mapped region and the remaining half for set-associative region.
Against to the victim cache and the half-and-half cache, the annex cache proposed by John
et al. [39] and the pollution control cache proposed by Walsh et al. [93] attempt to filter
2.4. HIGH-SPEED MEMORY-ACCESS TECHNIQUES 21
the data to be loaded into the main cache. Both the annex cache and the pollution control
cache are small high-associative caches attached to the main cache. On a cache miss, the
missed data is loaded into the small associative cache, instead of the main cache. Then the
cache lines in the main cache and the small associative cache are swapped when the filled
data in the small associative cache is referenced again. Therefore, no-reused data in the small
associative cache is evicted without loading into the main cache.
(2) Exploiting Different Types of Locality
The spatial locality can be exploited by increasing cache-line size. On the other hand,
decreasing cache-line size is a good approach to exploiting the temporal locality, because the
total number of entries, or cache lines, in the cache is increased. Unfortunately, conventional
caches have a fixed cache-line size, so that it is impossible to satisfy both the above mentioned
requirements. The most straightforward approach to solving the problem is to employ two
types of caches: one has a small cache-line size and the other has a large cache-line size.
Dual Data Cache proposed by Gonzalez et al. [23] consists of two memory modules: spatial
cache and temporal cache. These caches have the same organization, but the cache-line sizes
are different. The spatial cache has a larger cache-line size, whereas the temporal cache has a
smaller cache-line size. The locality prediction table in the dual data cache determines where
the missed data should be loaded. Each entry in the table corresponds to a recently executed
load/store instruction. Against to the dynamic optimization, Sanchez et al. [76] discussed a
static analysis of locality for the dual data cache.
Park [71] proposed the co-operative cache which consists of the spatial-oriented cache (SOC)
having a larger cache-line size and the temporal-oriented cache (TOC) having a smaller cache-
line size. Another example is the split temporal/spatial cache proposed by Milutinovic et al.
[62] which also has a spatial cache having a usual cache-line size and a temporal cache having
a small cache-line size.
(3) Prohibiting Non-Critical Data from Polluting Cache Space
So far, we have introduced many techniques to improve cache-hit rates. However, improving
the cache-hit rates may not be able to produce an advantage for total system performance.
When we consider the total execution time of a program, the most important thing is to
reduce the total number of clock cycles required. From memory system point of view, we
need to consider the total number of processor stalls caused by the real memory system.
22 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
Recent processors exploit increased instruction level parallelism (ILP), thereby achieving
higher performance. In other words, lack of ILP degrades the total processor performance.
The cache-hit rate may not be an appropriate metric to evaluate total memory system
performance, because it does not include how much each load/store operation affects the
total number of processor stalls. The processor stall depends on data dependency in program,
so that some cache misses which affect the data dependency are more critical than others.
In addition, if there are enough instructions which can be issued, a cache miss might not
affect the ILP. Actually, Srinivasan et al. [83] showed that not all data accesses need to occur
immediately if there are enough ready instructions for the processor to execute.
Non-Critical Buffer proposed by Fisk et al. [21] is a small associative buffer (for example
16-entry) which works in parallel with level-1 data cache (main cache). The Non-Critical
Buffer is used to prohibit non-critical data from polluting cache space. As a result, the large
main cache can be used for critical data which gives a significant damage to the processor
performance (i.e., ILP). Two mechanisms to identify the non-critical data at run-time were
proposed. One of the mechanisms tracks the processor performance by monitoring issue rate
or functional unit usage, and the other mechanism uses the Load/Store Queue (LSQ). Fisk
et al. reported that the non-critical buffer can achieve processor performance improvements
even if it worsens the cache-hit rates.
2.4.2.4 Data Prefetching by Larger Cache-Line Sizes
If we can perform perfect prefetching, the ideal memory system can be realized because all
main-memory accesses are overlapped with other computations. Increasing cache-line size
is one of methods to perform the data prefetching. If memory references have rich spatial
locality, larger cache-line sizes give the prefetching effect. However, the following drawbacks
prevent cache designers from increasing the cache-line size.
• Increase in conflict misses: increasing the cache-line size results in reducing the total
number of cache lines which can be held in the cache. Thus, large cache-line sizes
increase conflict misses when programs have poor spatial locality (increase in CMR in
Equation (2.1)), thereby degrading the memory system performance.
• Increase in the memory bandwidth requirement: On cache misses, large cache lines need
to be replaced between the cache and the main memory. Therefore, increasing cache-
2.4. HIGH-SPEED MEMORY-ACCESS TECHNIQUES 23
line size increases the memory bandwidth required (increase in LineSize in Equation
(2.2)). However, the I/O pin bottleneck between the cache and the main memory in
conventional systems limits the attainable memory bandwidth. As a result, increasing
the cache-line size increases miss penalty (increase in TMainMemory in Equation (2.1)),
thereby degrading the memory system performance.
For the cache-line-size optimization, the following processes are required: 1) detect the
amount of spatial locality inherent in programs, and 2) modify the cache-line size. We can
consider the following approaches to detecting the locality and to modifying the cache-line
size:
• Hardware detection and hardware modification: The amount of spatial locality is mea-
sured at run-time. In this method, a mechanism to record memory-reference history
will be required. The cache-line size is modified according to the memory-reference
history [20], [41], [37], [92], [57].
• Software detection and software modification: The amount of spatial locality is analyzed
at compile-time. In this method, loop structures inherent in programs will be exploited.
Special instructions are inserted in program codes by compiler in order to modify the
cache-line size [91], [92].
• Hardware detection and software modification: The amount of spatial locality is mea-
sured at run-time. A compiler inserts special instructions in program codes. The special
instruction says ”if a condition is satisfied, then increase (or decrease) the cache-line
size”. The run-time measurement is exploited to determine the condition of the special
instruction for modifying the cache-line size.
The detail of the line-size optimization techniques are discussed in Chapter 5.
2.4.2.5 Optimizing Data Placement
Conflict misses take place when two data compete for a cache location. If we can re-allocate
one of the competing data address, the conflict miss can be avoided. Data-placement opti-
mization is a static approach to reducing the conflict misses [87], [68], [33], [47].
24 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
2.4.3 Making Cache-Miss Penalty Smaller
The final approach to the ideal memory system is to reduce the miss penalty which is wasted
on cache misses. As shown in Equation (2.2), there are at least three approaches to minimizing
the miss penalty: 1) improving the DRAM access time, 2) reducing the cache-line size, and
3) increasing the memory bandwidth. The DRAM access time can be improved by advanced
process technology. However, the process-level optimization is out of this thesis.
Conventional caches exploit the spatial locality by employing larger cache-line size. There-
fore, there is a trade-off between improving the cache-hit rate and reducing the miss penalty.
Although increasing cache-line size improves cache-hit rates due to the effect of prefetching
(decrease in CMR in Equation (2.1)), it also increases the memory bandwidth requirement for
cache-line replacements (increase in TMainMemory in Equation (2.1)). Adapting the cache-line
size introduced in Section 2.4.2.4 attempts to find appropriate trade-off points. Employing
cache-bypass mechanism is also a good approach to reducing the memory bandwidth require-
ment. The idea of Tyson et al. is based on the fact that almost all cache misses are caused
by a small number of instructions called troublesome instructions. Tyson et al. [90] reported
that less than 5 % of the total load instructions are responsible for causing over 99 % of all
cache misses. When a troublesome instruction causes a cache miss, it bypasses the cache,
instead of loading into the cache. In this case, only the troublesome instruction is transferred
from the main memory to the processor.
Improving the memory bandwidth can be achieved by integrating the cache and the main
memory into the same chip, or merged DRAM/logic LSI. Eliminating the chip boundary
between the cache and the main memory solves the I/O-pin bottleneck problem, thereby
improving dramatically the memory bandwidth [64], [78], [72], [73].
2.5 Low-Energy Memory-Access Techniques
From Equation (2.4), it can be understood that there are at least three approaches to reducing
the average memory-access energy as follows:
• Reducing the cache-access energy (ECache), and maintaining the cache-miss rate and
the main-memory-access energy as can as possible.
2.5. LOW-ENERGY MEMORY-ACCESS TECHNIQUES 25
• Reducing the cache-miss rate (CMR), and maintaining the cache-access energy and
the main-memory-access energy as can as possible.
• Reducing the main-memory-access energy (EMainMemory), and maintaining the cache-
access energy and the cache-miss rate as can as possible.
The techniques introduced in Section 2.4.2 for reducing conflict misses will be available for the
second approach [4]. In the following sections, we focus on the first and the third approaches.
Energy dissipation in CMOS technology circuits is mainly due to charging and discharging
gates. While a cache access is performed, the following energy is dissipated:
ESRAMarray = 0.5 × C × V dd2, (2.7)
where, V dd is the supply voltage as well as the output voltage swing. C is the total switched
load-capacitance on all cache components (bit-lines, word-lines, memory cells, and so on). It
can be understood from Equation (2.7) that we can reduce the energy dissipation by making
a small value of C, or V dd. Reducing the supply voltage (V dd) has a great impact on the
energy dissipation, because Equation (2.7) is a function of the square of supply voltage.
However, it makes access time longer [15], [66], so that we do not consider this approach in
this thesis.
In the following sections, we introduce energy reduction techniques for cache and main-
memory accesses by reducing the switched load capacitance (C). Section 2.5.1 shows tech-
niques to reduce the cache-access energy: structural approach and behavioral approach.
Section 2.5.2 presents energy reduction techniques for main-memory accesses. DRAM (main
memory) consumes static energy not for main-memory accesses but for refresh operations.
Although the static energy is not included in Equation (2.4), some techniques to reduce that
energy are introduced in Section 2.5.3.
2.5.1 Reducing Cache-Access Energy
Basically, the energy dissipation for a memory-array access depends on the array size (or
the number of words held in the memory array) [38]. Let us consider where a 32-bit data
needs to be read form 16 KB (128 × 128 bit cells) cache space. Here, we refer to a fraction
of the cache space where needs to be activated for the cache access as an activated-area. In
addition, we refer to CBLcell and CWLcell as the switched load capacitance associated with a
26 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
single bit cell on bit-line and word-line, respectively. In order to simplify the explanation of
the activated-area, in this section, we assume that the cache-access energy is determined by
only charging/discharging of bit-lines and word-lines. In fact, other circuits, for example bit-
line precharging circuits, would dissipate some energy. In case of the original memory-array
organization, the activated-area is equal to the whole cache space. Thus, the switched load
capacitance in Equation (2.7) is 16384∗CBLcell +128∗CWLcell. However, if the memory-array
is divided into four modules (128 × 32 bit cells × 4), and if it is possible to activate only
one module, the activated-area becomes a quarter of the whole cache. In this case, the total
switched load capacitance is 4096 ∗ CBLcell + 32 ∗ CWLcell, thereby saving the energy.
The key of techniques introduced in this section is to make a small activated-area, and the
following processes are required:
1. Module Partitioning: Divide the cache into at least two modules, or attach at least one
small cache module. As a result, small areas, which are candidates of the activated-area,
are generated.
2. Selective Activation: Activate only one small area for performing the cache access.
We classify energy reduction techniques for cache memories into two approaches: structural
one and behavioral one. The structural approach changes the memory organization (the cache
is divided), but the cache-access operation is not modified. While the behavioral approach
attempts to optimize the cache-access operation for low energy dissipation, but the memory
organization is maintained (caches have originally a multi-module organization).
2.5.1.1 Structural Approaches
(1) Horizontal Partitioning
The techniques introduced in this section partition the cache module horizontally. Well
known techniques for memory-array partitioning is word-line partitioning [75]. In conven-
tional memory arrays, a number of transfer gates are connected to a word-line. The word-line
partitioning reduces the total number of memory cells connected to the word-line.
Cache subbanking [85], [48] is a horizontal partitioning scheme for low-energy caches. Usu-
ally, a cache line includes several words (for example 8 words) in order to exploit the spatial
locality of memory references. In conventional caches, the referenced data is selected in the
2.5. LOW-ENERGY MEMORY-ACCESS TECHNIQUES 27
corresponding cache line which is read from data memory. Thus, the remaining contents in
the cache line are unused. In the cache subbanking, the data memory is partitioned into
subarrays horizontally. Only subarrays designated by the offset-field in memory address is
activated.
Region-Based Caching proposed by Lee et al. [59] is another implementation of horizontal
partitioning. The region-based caching exploits the different characteristics of data type, and
consists of three cache modules: a small module for stack data, a small module for global
data, and larger main module for others. For example, a 4 KB direct-mapped stack-cache, a
4 KB direct-mapped global-cache, and a 32 KB direct-mapped main-cache are implemented.
Which module has to be searched can be determined by using the memory address. When
only the stack-cache or global-cache is activated, the energy dissipated for the cache access
can be reduced due to a small value of C in Equation (2.7), compared with conventional cache
organization. Lee et al. reported that about 70 % of memory references hit the stack-cache
or the global-cache.
(2) Vertical Partitioning
Against to the word-line partitioning, bit-line partitioning is another well known technique
[75]. Ghose et al. [22] evaluated the effects of the bit-line partitioning for the cache employed
by superscalar processors.
Employing a level-1 cache reduces the energy consumed for memory accesses because of
the small activated-area (i.e., accessing not to the large main memory but to the small level-1
cache). Similarly, adding a small level-0 cache between the level-1 cache and the processor
can make a significant energy reduction.
Su et at. [85] and Kamble et al. [48] evaluated the energy efficiency of cache-line buffering,
or block buffering, which consists of a single entry. The previous accessed cache-line is loaded
into the cache-line buffer. When a memory access is issued, the cache-line buffer is searched
first. If the memory access hits the cache-line buffer, the desired data is provided from the
cache-line buffer to the processor. In this case, the activated-area is the small cache-line
buffer, instead of the main cache. When memory references have rich temporal (and spatial)
locality, the buffer-hit rate is improved, thereby saving more energy. Ghose et al. [22]
proposed the multiple line buffer for superscalar processors, in which there are several (for
example four) entries. Kin et al. [54] proposed the filter cache. A level-1 cache access occurs
28 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
only on a filter-cache miss. Kin et al. reported that a filter cache reduces 51 % of energy-
delay product across a set of multimedia and communication applications compared with a
conventional cache organization. Another study for this kind of approach is demonstrated by
Bajwa et al. [5], in which a level-0 small cache is employed for reducing the energy consumed
for an instruction cache.
The effectiveness of vertical partitioning depends largely on how much the memory accesses
can be concentrated on the small level-0 cache. Bellas et al. [6] proposed a dynamic cache
management to allocate the most frequently executed instruction blocks to the small level-0
cache. A branch prediction unit is exploited for detecting the frequently executed blocks.
(3) Horizontal and Vertical Partitioning
Ko et al. [56] proposed the MDM (multi-divided module) cache architecture. The cache
is divided horizontally and vertically into small modules. Each small module includes own
peripheral circuits, so that it can operate as a stand-alone cache. Only a single small module
designated by the memory address is activated. When the MDM cache has M independently
selectable modules, the average load capacitance of which becomes almost 1/M compared
with a non-divided conventional organization.
(4) Static and Dynamic Regions Partitioning
Adding a small level-0 cache between the level-1 cache and the processor seems to be an
extension of memory hierarchy. Because data replacements between the level-0 and level-1
cache are required on level-0 cache misses. Another approach to reducing cache-access energy
is to partition the cache module into a small static module and a large dynamic module. Data
allocation to the static module is determined based on prior program analysis and that works
as scratch-pad memory explained in Section 2.4.2.2, whereas the dynamic module behaves
as normal caches. If we can concentrate memory accesses on the small static module, a lot
of energy can be reduced due to a small value of C in Equation (2.7).
Many techniques for data allocation to the static module have been proposed. These
techniques are based on the profile data from the execution of programs. Panwar et al. [70]
proposed S-cache. Frequently executed basic-blocks are placed in the S-cache (a small static
module). Jump instructions which control the execution flow between the S-cache and the
level-1 main cache are inserted in program codes. Although the S-cache has been proposed
for reducing the frequency of tag compares, it can be used for reducing energy consumption
2.5. LOW-ENERGY MEMORY-ACCESS TECHNIQUES 29
for cache-line accesses. Bellas et al. [7], [8] proposed the loop cache (L-cache) for instruction
caches. The compiler lays out the target program to maximize the number of accesses to the
L-cache, and inserts special instructions to identify the boundary between the placed and
non-placed code into the L-cache.
Increasing the static-module size increases the static-module hit rate. However, it also
increases the energy dissipation for an access to the static module. Ishihara et al. [38]
discussed the trade-off between the size of the main level-1 cache and that of the static
module. Kawabe et al. [50] presented an implementation example of the static and dynamic
region partitioning. Of cause, the scratch-pad memory introduced in Section 2.4.2.2 can be
employed for this kind of low-energy techniques.
2.5.1.2 Behavioral Approaches
As explained in Section 2.3, all ways in set-associative caches are searched in parallel because
the cache-access time is critical. Thus, the energy consumed for a tag-subarray access and
that for a data-subarray access are consumed in each way. Since only one way has the data
desired by the processor on a cache hit, however, conventional set-associative caches waste a
lot of energy. Some techniques have been proposed for alleviating the negative effect of the
set-associative caches by optimizing cache-access behavior.
(1) Selective Way Activation
The activated-area in conventional set-associative caches includes all ways. However, all
way accesses but one are unnecessary. One of approaches to achieving energy reduction for
set-associative caches is to make the activated-area close to a single way which includes the
desired data.
Hitachi SH microprocessor employs a phased cache in order to avoid the unnecessary data-
subarray accesses [26]. In the phased cache, tag comparison and cache-line access are per-
formed sequentially. First, tag comparisons are performed without data-subarray activation.
Then, only a single data-subarray which includes the desired data is accessed if at most one
tag matches. Otherwise, a cache-line replacement is performed without any data-subarray
access. Although this approach reduces the energy consumed for data-subarray accesses
(cache-line accesses), the cache-access time will be increased due to the sequential flow. If we
know which way includes the desired data before starting the cache access (i.e., without per-
30 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
forming the tag comparison), the unnecessary way-accesses can be eliminated without cache-
access-time overhead. Thus, the way-prediction techniques introduced in Section 2.4.1.1 can
be used for reducing cache-access energy [35], [53]. A correct way-prediction makes it possible
to activate only the desired way without using the tag-comparison results. The detail of the
way-prediction techniques for low-energy dissipation is explained in Chapter 3.
As shown in Figure 2.1 in Section 2.3, conventional set-associative caches consist of several
memory-subarrays (i.e., ways). Albonesi [3] proposed the selective cache ways which allows
software to optimize the cache size and associativity. Each way can be enabled/disenabled
by a Cache Way Select Register (CWSR). For example, 32 KB four-way set-associative cache
can operate as an 8 KB direct-mapped cache, a 16 KB two-way set-associative cache, or a 32
KB four-way set-associative cache. Namely, the activated-area corresponds to the cache size
specified by the CWSR. A software as operating system can determine the trade-off between
the performance and the energy dissipation by modifying the CWSR.
(2) Omitting Tag Comparison
In conventional caches, tag comparison is performed in every access to determine whether
the current access hits the cache. Panwar et al. [70] proposed a conditional tag-comparison
which attempts to reduce the total count of tag comparison required in the execution of pro-
grams. If two successive instructions i and j reside in the same cache-line, the tag comparison
for j can be omitted. Another approach to omitting the tag comparison is to exploit exe-
cution footprints. The condition for performing the tag comparison is determined based on
the history of program execution [34]. The tag comparison for instruction j can be omitted,
even if instruction i and j are reside in different cache lines. The detail of this technique is
discussed in Chapter 4.
2.5.2 Reducing Data-Transfer/Main-Memory-Access Energy
As introduced in Section 2.5.1.1(1), cache-access energy can be reduced by employing the
cache subbanking. The idea of subbanking comes from that the referenced data is only one
word, instead of a whole cache line. As the offset of the referenced data in a cache line is
determined by the memory address, the selective activation of a subbank can be implemented.
However, this kind of technique can not be employed for main memory. Since the cache-line
size is fixed, a DRAM-array access with a fixed cache-line size takes place on every main-
2.6. CONCLUSIONS 31
memory access. The main-memory subbanking can be achieved by reducing the size of data
to be replaced between the cache and the main memory. Therefore, the techniques for low
memory traffic introduced in Section 2.4.3, cache bypassing introduced in Section 2.4.2.2,
and reducing the cache-line size introduced in Section 2.4.2.4 are useful for the main-memory
subbanking. The detail of main-memory subbanking approach based on line-size optimization
is discussed in Chapter 5.
Another approach to reducing the energy consumed for data transfer is bus cording. A
corded data is transferred from a sender to a receiver, instead of a raw data. The data to
be transferred is coded in order to reduce the number of bus transitions, thereby saving the
energy [84], [9]. Hardware components for decode and encode are required. Tomiyama et al.
[88] proposed a technique to reduce bus transitions based on instruction scheduling. Since the
number of bus transitions are reduced by only re-ordering the instructions, their approach
does not require any hardware overhead (i.e., no encoder and decoder are required).
2.5.3 Reducing DRAM-Static Energy
DRAM consumes not only the dynamic energy caused by DRAM accesses but also static
energy for refresh operations. The static energy is not included in Equation (2.4). However,
this energy consumption is also important for low-energy memory systems. One of approaches
to reducing the static energy-consumption is to reduce the total count of DRAM refresh to
be performed.
Ohsawa et al. [67] proposed selective refresh scheme to optimize DRAM refresh count
required for the execution of programs. The compiler analyze data lifetime, then it inserts a
hint in store instructions to determine whether the stored data needs to be refreshed. Another
approach to reducing standby power is to exploit DRAM power states [58], [19].
2.6 Conclusions
In this chapter, we have surveyed the techniques for high speed, low energy memory systems.
The best way to improve performance/energy efficiency is to achieve fast and low-energy
access at each level of memory hierarchy and to concentrate memory accesses on the closest
level to the processor. In order to imitate an ideal memory system, almost all techniques
32 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
introduced in this chapter stand on the locality of memory references: temporal locality and
spatial locality.
Against to Equation (2.1) and (2.4), the cache architectures proposed in this thesis have
the following effects:
• The way-predicting set-associative cache introduced in Chapter 3 reduces the cache-
access energy (ECache in Equation (2.4)), and maintains the cache-miss rate of a con-
ventional organization with the same cache size and associativity. However, our cache
incurs an acceptable cache-access-time overhead due to wrong way-predictions. The
main-memory-access time and energy are also maintained. The way-predicting set-
associative cache belongs to the behavioral approach explained in Section 2.5.1.2.
• The history-based tag-comparison cache introduced in Chapter 4 reduces the cache-
access energy (ECache in Equation (2.4)) of direct-mapped instruction caches, and main-
tains the cache-miss rate of a conventional organization with the same cache size. The
cache-access time, main-memory-access time, and main-memory-access energy are also
maintained. The history-based tag-comparison cache belongs to the behavioral ap-
proach explained in Section 2.5.1.2.
• The dynamically variable-line size cache introduced in Chapter 5 reduces the cache-miss
rate (CMR in Equation (2.1)), and maintains the cache-access time of a conventional
organization with the same cache size and associativity. Our cache also reduces the
main-memory-access energy (EMainMemory in Equation (2.4)). In addition, exploiting
the high on-chip memory bandwidth of merged DRAM/logic LSIs reduces the main-
memory-access time (TMainMemory in Equation (2.1)). The dynamically variable-line size
cache belongs to the data-prefetching high-performance technique explained in Section
2.4.2.4, the high-performance technique using high memory bandwidth explained in
Section 2.4.3, and energy reduction techniques explained in Section 2.5.2. We can
regard the energy reduction technique of the dynamically variable-line size cache (i.e.,
optimizing cache-line size at run-time) as a behavioral approach.
Although many cache architectures for improving memory-system performance have been
proposed, one of the most important goal in future memory systems is to achieve high per-
formance and low energy dissipation at the same time. From energy point of view, we have
2.6. CONCLUSIONS 33
introduced two approaches to reducing energy dissipation: structural approach explained
in Section 2.5.1.1 and behavioral approach explained in Section 2.5.1.2. The structural ap-
proach attempts to reduce energy dissipation by improving the hierarchical memory organi-
zation (memory module is partitioned vertically and/or horizontally). We believe that it is
promising to develop innovative cache architectures based on the behavioral approach and
to combine those with the structural approach.
34 CHAPTER 2. MEMORY SYSTEMS EMPLOYING CACHE MEMORIES
Chapter 3
Way-Predicting Set-Associative
Cache Architecture
3.1 Introduction
Many modern processors employ set-associative caches as L1 or L2 caches. Since an n-way
set-associative cache has n locations where a cache line can be placed, it can offer higher hit
rates than direct-mapped caches. However, increasing cache associativity makes the cache
access time longer due to the delay for way selection based on tag-comparison results. To
compensate for this disadvantage, several researchers have proposed way-predictable set-
associative caches [14],[16],[52],[99]. In fact, the way-prediction technique has been employed
in commercial processors [89],[98].
The cited papers have focused on only the performance improvement achieved by the way
prediction. However, we believe that way prediction can offer a significant energy reduc-
tion in set-associative caches. In this chapter, we propose a low power cache architecture
using the way prediction, called way-predicting set-associative cache. The way-predicting
set-associative cache speculatively selects one way, which is likely to contain the data desired
by the processor, from the set designated by the memory address, before it starts the normal
cache access. In conventional set-associative caches, all ways are accessed to compensate
for long access time. Since the only one way has the referenced data, however, the other
way accesses are unnecessary. The correct way-prediction makes it possible to eliminate the
unnecessary way activation, so that the energy can be saved.
35
36 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
The rest of this chapter is organized as follows. Section 2 summarizes the energy con-
sumption and the cache-access time of a conventional set-associative cache. In addition, a
low-power set-associative cache is described as a counterpart to our architecture. Section 3
discusses the way-predicting set-associative cache in detail. Section 4 and Section 5 evalu-
ates qualitatively and quantitatively the way-predicting cache in terms of both energy and
performance. Section 6 shows related work, and Section 7 gives some concluding remarks.
3.2 Set-Associative Cache
3.2.1 Conventional Caches
labelsecWP:Conventional Caches
The cache-access energy depends on the energy dissipated for the SRAM access, as ex-
plained in section 2.2. In this chapter, we simplify the cache-access energy as follows:
ECache ≈ ESRAMarray (3.1)
= NTag × ETag + NData × EData (3.2)
• NTag, NData: The average number of tag-subarrays, and data-subarrays, to be activated
for a cache access.
• ETag, EData: Energy dissipated for a tag-subarray access and that for a data-subarray
access, respectively.
In conventional set-associative caches, all the ways are activated regardless of hits or misses,
and the cache access can be completed in one cycle. Accordingly, average cache-access energy
(ECache) and time (TCache) of a conventional four-way set-associative cache (4SACache) can
be expressed by the following equations:
E4SACache = 4ETag + 4EData. (3.3)
T4SACache = 1Cycle. (3.4)
3.2.2 Phased Set-Associative Cache
Although at most only one way has the data desired by the processor, all the ways are
accessed in parallel, as shown in Figure 3.1(a). Thus, a lot of energy will be wasted in
3.2. SET-ASSOCIATIVE CACHE 37
conventional set-associative caches. To solve this issue, Hasegawa et al. proposed a low-power
set-associative cache architecture [26], we refer to which as phased set-associative cache. As
shown in Figure 3.1(b), the phased set-associative cache divides the cache-access process into
the following two phases:
• Cycle 1: All the tags in the set indexed by the memory address are read out from tag-
subarrays in parallel. Then the tags are compared with the tag-portion in the memory
address for cache lookup. No data accesses occur during this phase.
• Cycle 2: If one of the tag-comparison results is a match, the matching way includes the
data desired by the processor. In this case, only the data-subarray in the matching way
is accessed. The remaining ways are not activated, so that the phased set-associative
cache can reduce the energy consumption. If there is no matched tags, the referenced
data does not reside in the cache. Accordingly, the cache access is terminated without
any data-subarray access, and a cache replacement is performed.
As explained above, the phased set-associative cache reduces the energy consumption by elim-
inating unnecessary way accesses. The phased four-way set-associative cache (P4SACache)
makes the 3EData, and 4EData, energy reduction from the conventional four-way set-associative
cache (4SACache) on cache hits, and cache misses, respectively. However, the cache suffers
from a longer cache-access time. There is no access time penalty on cache misses, because
the data accesses are not performed. While two phases for the sequential access have to be
performed on cache hits. The average energy consumption for a cache access (EP4SACache),
and the average cache-access time (TP4SACache), of the phased four-way set-associative cache
can be expressed as follows:
EP4SACache = 4ETag + CHR × EData. (3.5)
TP4SACache = 1Cycle + CHR × 1Cycle. (3.6)
Here, CHR is the cache hit rate.
38 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
mux-drive
(a) Conventional 4-way set-associative cache (b) Phased 4-way set-associative cache
Data-Array
Cycle1
Accessed Subarray
Tag-Array
Way0 Way1 Way2 Way3
mux-drive
Way0 Way1 Way2 Way3
mux-drive
Way0 Way1 Way2 Way3
Cycle1
Cycle2
Figure 3.1: Phased Set-Associative Cache.
3.3 Way-Predicting Set-Associative Cache
3.3.1 Concept
The phased set-associative cache explained in section 3.2.2 attempts to eliminate unnecessary
data-subarray accesses by allowing the cache-hit time penalty. For almost all programs, cache
hit rates are very high even data caches because of high locality of memory references. As
the memory system performance strongly affects total program execution time, it is very
important to maintain fast cache accesses, especially on hits.
The way-predicting set-associative cache speculatively chooses one way before starting the
normal cache-access process. Then the cache divides the cache-access process into two phases,
like the phased set-associative cache but not the same, as follows:
• Cycle 1: Both of a tag and a cache line from only the predicted-way are read out, and
then the tag comparison is performed. If the tag-comparison result is a match, the
data desired by the processor is provided from the cache line read out, and the cache
access is completed successfully. In this case, the way-predicting set-associative cache
3.3. WAY-PREDICTING SET-ASSOCIATIVE CACHE 39
mux-drive
(a) Prediction-Hit (b) Prediction-Miss
Data
Accessed Subarray
Tag
Way0 Way1 Way2 Way3
mux-drive
Way0 Way1 Way2 Way3
mux-drive
Way0 Way1 Way2 Way3
Cycle1 Cycle1
Cycle2
Predicted Way Predicted Way
Figure 3.2: Way-Predicting Set-Associative Cache.
behaves as a direct-mapped cache, as shown in Figure 3.2(a). If the tag-comparison
result is not a match, then the second phase is performed.
• Cycle 2: The cache searches the other remaining ways in parallel, as shown in Fig-
ure 3.2(b). If one of the tag-comparison results is a match, the data from the hit way
is provided to the processor. Otherwise, a cache replacement takes place. Namely, the
way-predicting set-associative cache behaves as a “three-way” set-associative cache in
this phase.
On a prediction-hit, as shown in Figure 3.2(a), the way-predicting set-associative cache
consumes energy only for activating the predicted way. In addition, the cache access can be
completed in one cycle. On prediction-misses (or cache misses), however, the cache-access
time increases due to the successive process of two phases as shown in Figure 3.2(b). Since all
the remaining ways are activated in the same manner as conventional set-associative caches,
the way-predicting set-associative cache could not reduce energy consumption in this scenario.
The performance/energy efficiency of the way-predicting set-associative cache largely depends
40 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
on the accuracy of the way prediction.
The average energy consumption for an access (ECache), and the average cache-access
time (TCache), of the way-predicting four-way set-associative cache (WP4SACache) can be
expressed as follows:
EWP4SACache = (ETag + EData) + (1 − PHR) × (3ETag + 3EData) (3.7)
TWP4SACache = 1Cycle + (1 − PHR) × 1Cycle (3.8)
Here, PHR is the prediction-hit rate. The phased set-associative cache has the cache-access
time penalty on cache hits, while the way-predicting set-associative cache has that on cache
misses. The total number of cache hits is much more than that of cache misses, so that the
way-predicting set-associative cache has a significant advantage for the cache-access time,
compared with the phased set-associative cache.
3.3.2 Way Prediction
Many application programs have higher locality of memory references. This means that a
cache line referenced by the processor will be referenced to again in the near future.
Here, it is assumed that a seti is accessed by a processor for cache look-up, and an wayj
(0 ≤ j ≤ AS − 1, where AS is cache associativity) causes a cache hit. In this case, the
data required by the processor will reside in the wayj on a near future access to the seti if
programs have higher locality of memory references. Accordingly, we have decided to employ
an way-prediction policy based on MRU(Most Recently Used) algorithm. The way predictor
determines a predicted way for the set which has being accessed by the processor as follows:
• On prediction-hits, the way predictor does not do anything because the current way-
prediction is correct.
• On prediction-misses (but cache hits), the way predictor regards the way having the
data desired by the processor as the predicted way. The predicted way can be deter-
mined by tag comparison results.
• On cache-misses, the way predictor regards the way to be filled on cache replacement
as the predicted way. The predicted way can be determined by the results of tag
comparisons (hit or miss) and status flags indicating which way to be replaced.
3.3. WAY-PREDICTING SET-ASSOCIATIVE CACHE 41
Tag
Tag Index
Data Tag Data Tag Data Tag Data
Way-PredictionTable
Access Controler
Contorol Signals
Reference-Address
2-bits
Way0 Way1 Way2 Way3
Tag Tag Tag
Way-Predictor
Status
Reference-LineMux Drive
used on Cache-misses
used on Prediction-misses
used on Prediction-hits
Way-Prediction Flag
Figure 3.3: Organization of Way-Predicting Four-Way Set-Associative Cache.
3.3.3 Organization
Figure 3.3 gives an organization of the way-predicting four-way set-associative cache (WP4SACache).
Compared to the conventional four-way set-associative cache (4SACache), only the following
additional components are required:
• Way-prediction table, which contains a two-bit flag (way-prediction flag) for each set.
The two-bit flag is used to speculatively choose one way from the corresponding set.
• Way predictor, which determines the value of each way-prediction flag according to the
MRU (most-recently used) algorithm explained in Section 3.3.2.
The way-predicting four-way set-associative cache (WP4SACache) works as follows:
1. The way-prediction flag associated with a given set is accessed, and is read from the
way-prediction table immediately after an effective memory address is generated. The
predicted way is determined by the way-prediction flag read out.
42 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
2. When a memory access takes place, the WP4SACache starts to decode the memory
address in the same manner as conventional set-associative caches.
3. Only the predicted way is activated, and the tag and the cache-line associated with the
predicted way are read simultaneously. The tag is then compared with the tag-portion
of the memory address. If the tag-comparison result is a match (prediction-hit), the
cache access completes successfully. Otherwise, steps 4 and 5 are performed.
4. The remaining three ways are activated, and all the tags and the cache-lines are read
out in parallel. Then, the three tags are compared with the tag-portion of the memory
address. If at most one tag matches (prediction-miss), the WP4SACache provides the
referenced data to the processor. Otherwise (cache-miss), a cache replacement takes
place.
5. The way predictor modifies the way-prediction flag based on the results of the tag com-
parison or the status flags as explained in Section 3.3.2. The modified way-prediction
flag is written back to the way-prediction table. Note that this write back operation
does not make any cache-access time penalty, because it can be performed during data
transfer from the cache to the processor or cache replacement processes.
3.4 Evaluations: Theoretical Analysis
To clarify the upper and lower bound of the performance/energy improvements achieved
by the way-predicting set-associative cache architecture, we performed a qualitative analysis
using the energy and performance equations defined in the previous sections. Figure 3.4
shows the average energy consumption and the average cache-access time based on equations
from (3.3) to (3.8) for :
• a conventional four-way set-associative cache (4SACache),
• a phased four-way set-associative cache (P4SACache), and
• an way-predicting four-way set-associative cache (WP4SACache).
For every cache, the cache size, cache-line size, and associativity (the number of ways) are
16 K bytes, 32 bytes, and 4, respectively. Because the same replacement algorithm (usually
3.4. EVALUATIONS: THEORETICAL ANALYSIS 43
0.6
0.20.4
00.2 0.4 0.6 0.8 1
1.01.52.0
1
Tca
che
(# o
f C
ycle
s)
0.8
0 0.2 0.4 0.6 0.8 1
0.20.4
0.6
Eca
che
(# o
f E
dat
a)
01234
10.8
Cache
Miss Rate
Prediction-Hit Rate
Cache
Miss Rate
Prediction-Hit Rate
(a) Energy and Cache-Access Time
(b) Average Energy Consumption and Average Cache-Access Time when CMR=5%
0.0
0.5
1.0
1.5
2.0
2.5
0.2 0.4 0.6 0.8 0.95Prediction Hit Rate
95%
5%
1.0
2.0
3.0
4.0
5.0
0.0 0.2 0.4 0.6 0.8 0.95Prediction Hit Rate
60%
71%
Conventional Cache (4SACache)Phased Cache (P4SACache)Way-Predicting Cache (WP4SACache)
Eca
che
(# o
f E
)D
ata
Conventional Cache (4SACache)Phased Cache (P4SACache)Way-Predicting Cache (WP4SACache)
Tca
che
(# o
f C
ycle
s)
PHR=0.8
PHR=0.8
Figure 3.4: Average Energy Consumption per Cache Access and Average Cache-Access Time
LRU) is used for every cache, the cache-hit rate (CHR) is common to all the caches.
Figure 3.4(a) plots the average energy consumption per cache access, and the cache-access
time, as a function of the prediction-hit rate (PHR) and the cache-miss rate (CMR =
1 − CHR) for each cache (4SACache, P4SACache, and WP4SACache). The following as-
sumptions are made:
• The address size, set-index size, and byte-offset size are 32 bits, 7 bits (= log2 128), and
5 bits (= log2 32), respectively. Thus the tag size is 20 bits.
• ETag = 0.078EData because the ratio of the tag size to the cache-line size is 20 : 256 (in
44 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
terms of bits), or 0.078 : 1.
We can see the followings two results form figure 3.4(a). First, the way-predicting cache
performs best when PHR = 100% (i.e., CHR = 100%). In this case, the average energy
consumption is reduced by 75% without any cache-access time degradation, compared to the
conventional set-associative cache. On the other hand, although the phased cache also makes
about 75 % energy reduction, the cache-access time becomes two times longer due to the
sequential accesses on cache hits. Seconds, the way-predicting cache performs worst, even
if CHR = 100%, when PHR = 0%. Compared to the conventional set-associative cache,
the average cache-access time increases by 100% while the average energy consumption is
unchanged. Though the cache-access time is the same, the average energy consumption is
greater by 229%, compared to the phased cache.
Figure 3.4(b) cuts a cross section when CMR = 5% from Figure 3.4(a), and plots the
average energy consumption and cache-access time as the function of the prediction-hit rate
(PHR). From Figure 3.4(b), assuming the 0.8 prediction-hit rate, the following observations
are made:
• Phased cache vs. conventional cache: Compared to the conventional cache, the phased
cache can reduce the average energy consumption by 71%, but it increases the average
cache-access time by 95%.
• Way-predicting cache vs. conventional cache: If the way-predicting cache achieves a
95% prediction-hit rate, the average energy consumption can be reduced by 71% while
maintaining performance comparable to that of the conventional cache. When PHR is
80%, the way-predicting cache can achieve 60% energy reduction with only 5% average
cache-access time overhead.
3.5 Evaluations: Experimental Analysis
3.5.1 Simulation Environment
To evaluate the effectiveness of the way-predicting set-associative cache architecture on real
work loads, we performed a quantitative analysis using benchmark programs. We made
some experiments using a cache simulator. The cache simulator gets an address trace from
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 45
QPT[31] as its input, and simulates the LRU cache replacement algorithm and the MRU way-
prediction algorithm. And then, the cache simulator reports the prediction-hit rate (PHR),
prediction-miss rate (PMR), and cache-miss rate (CMR) as its outputs. All benchmark
programs were compiled by GNU CC (–O2) for the UltraSPARC. We used the programs
listed in table 3.1 from the SPEC95 benchmark suite [82].
Table 3.1: Benchmark Programs.
Programs Inputs
SPECint95 099.go,
124.m88ksim, 126.gcc,
129.compress, 130.li,
132.ijpeg, 134.perl,
147.vortex
training
SPECfp95 101.tomcatv,
102.swim, 103.su2cor,
104.hydro2d
test
3.5.2 Way-Prediction Hit Rates
Table 3.2 shows the benchmark results. As can be seen from the table, I-caches of all the
programs achieve quite high prediction-hit rates (PHR) of over 90%, and the average PHR
is about 96%. This results can be understood by considering the behavior of instruction ref-
erences. Programs is basically based on the incremental execution of successive instructions.
As cache line consists of several instructions to exploit the spatial locality of references, the
MRU way-prediction algorithm works very well. For the D-caches, more than half of the
programs also achieve high prediction-hit rates (PHR) of over 90%, and the average PHR
is about 86% which is lower than that of I-caches. The data references also have spatial
locality, but is not so higher than that of I-caches. Since the data reference behavior depends
on program characteristics, the accuracy of way prediction based on MRU algorithm is highly
application-dependent.
46 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
Table 3.2: Benchmark Results: Prediction-Hit Rate (PHR), Prediction-Miss Rate (PMR),
and Cache-Miss Rate (CMR)
Benchmarks I-Cache D-Cache
PHR(%) PMR(%) CMR(%) PHR(%) PMR(%) CMR(%)
099.go 94.55 4.04 1.41 81.31 17.45 1.24
124.m88ksim 95.76 4.05 0.19 95.47 3.63 0.91
126.gcc 92.32 5.09 2.59 87.40 9.59 3.01
129.compress 99.98 0.02 0.00 91.64 3.63 4.73
130.li 97.28 2.71 0.00 92.82 3.91 3.27
132.ijpeg 99.74 0.25 0.01 92.60 6.38 1.02
134.perl 94.93 4.65 0.42 92.64 5.78 1.58
147.vortex 91.65 7.11 1.25 89.38 9.16 1.46
101.tomcatv 91.61 7.30 1.09 87.96 9.96 2.08
102.swim 97.96 2.04 0.00 50.27 31.71 18.03
103.su2cor 96.48 3.23 0.28 85.22 8.14 6.64
104.hydro2d 98.28 1.43 0.29 89.41 3.55 7.04
Average 95.87 3.49 0.62 86.34 9.41 4.25
Among all the benchmark results, the PHR of D-cache for 102.swim is the lowest at 50%,
and the cache-miss rate (CMR) is also the highest at 18%. If we consider that the CMR
of a direct-mapped cache with the same cache size and cache-line size for 102.swim is about
13.8%, it seems that the LRU cache-replacement algorithm and the MRU way-prediction
algorithm do not match the memory-reference pattern of 102.swim.
3.5.3 Cache-Access Time and Energy
Based on the models of energy consumption and cache-access time expressed by equations
from (3.3) to (3.8), and on the benchmark results reported in section 3.5.2, figure 3.5, and
figure 3.6, show the average energy consumption per cache access and the average cache-
access time of the I-cache, and the D-cache, respectively. All results of each program are
normalized to the conventional four-way set-associative cache (4SACache).
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 47
0.00
0.10
0.20
0.30
0.40
0.90
1.00N
orm
aliz
ed E
cach
e
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
101.tomcatv102.swim
103.su2cor104.hydro2d
Benchmarks
Phased Cache (P4SACache)
Way-Predicting Cache (WP4SACache)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
No
rmal
ized
Tca
che
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
101.tomcatv102.swim
103.su2cor104.hydro2d
Benchmarks
Figure 3.5: Average Energy Consumption and Average Cache-Access Time for I-Cache.
For many programs, the way-predicting I-cache produces better results than the phased I-
cache. The way-predicting cache activates only one tag-subarray on prediction-hits, while the
phased cache activates all tag-subarrays regardless of cache-hits or cache-misses. Accordingly,
the way-predicting cache can reduce more energy consumption than the phased cache when
the prediction-hit rate is high.
48 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
No
rmal
ized
Tca
che
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
101.tomcatv102.swim
103.su2cor104.hydro2d
Benchmarks
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
1.00N
orm
aliz
ed E
cach
e
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
101.tomcatv102.swim
103.su2cor104.hydro2d
Benchmarks
Phased Cache (PSACache)
Way-Predicting Cache (WP4SACache)
Figure 3.6: Average Energy Consumption and Average Cache-Access Time for D-Cache.
Compared to the conventional cache, for most of the programs, the phased cache reduces
the average energy consumption by about 70%, but it increases the average cache-access
time by about 100%. On the other hand, the way-predicting cache achieves mostly the same
energy reduction as the phased cache without significant drawbacks on performance.
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 49
0.00
400*10
800*10
1200*10
1600*10
2000*10
2400*10
2800*10
3200*10
3600*10
4000*10
4400*10
4800*10
5200*10
5600*10
11000*10
En
erg
y C
on
sum
ed in
Cac
hes
[E
dat
a](E
cach
e *
To
tal #
of
Mem
ory
Ref
eren
ces)
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
101.tomcatv102.swim
103.su2cor104.hydro2d
Benchmarks
Conventional Cache
Way-Predicting Cache(WP4SACache)
To
tal En
ergy w
here E
data is 2.5 n
J [Jou
le]6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
12.0
13.0
14.0
27.5
Figure 3.7: Total I-Cache Energy Dissipated for The Execution of Programs.
Figure 3.7 and Figure 3.8 show the total energy consumed in caches during the execution
of each program, i.e., the total number of memory references × average cache-access energy
(ECache), where the energy dissipated for a data-subarray access (EData) is 2.5 nano jules.
EData is calculated based on the Kamble’s model [48], and the 0.8 micron CMOS cache design
described in [95] is assumed. In order to obtain the value of EData by using the Kamble’s
model, the following parameters are assumed:
• Total number of rows, or Nrow, is 128 ( 16KB32B×4way
= 128sets).
• Tag bits, or T , is 0 because EData does not include the energy dissipated in tag-
subarrays.
• Associativity, or M , is 1 because EData is the energy dissipated in one data-subarray
(not in all data-subarrays).
A more detailed explanation of the calculation is presented in Section 5.6.3.
50 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
0.00
200*106
400*106
600*106
800*106
1000*106
1200*106
1400*106
1600*106
1800*106
2000*106
2200*106
2400*106
2600*106
2800*106
3000*106E
ner
gy
Co
nsu
med
in C
ach
es [
Ed
ata]
(Eca
che
* T
ota
l # o
f M
emo
ry R
efer
ence
s)
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
101.tomcatv102.swim
103.su2cor104.hydro2d
Benchmarks
Conventional Cache
Way-Predicting Cache(WP4SACache)
To
tal En
ergy w
here E
data is 2.5 n
J [Jou
le]
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
Figure 3.8: Total D-Cache Energy Dissipated for The Execution of Programs.
3.5.4 Energy-Delay Product
To evaluate both of energy and performance at the same time, we measured the energy-
delay products for each cache. Figure 3.9 shows the mean ED product (=average energy con-
sumption per cache access × average cache-access time) for all of the benchmark programs.
Again, these values are normalized to those of the conventional four-way set-associative cache
(4SACache). The I-Cache&D-Cache in the figure shows the average ED product per instruc-
tion execution. Figure 3.9 indicates that the way-predicting cache improves the mean ED
product by about 70% and 60% for the I-cache and the D-cache, respectively, compared with
the conventional set-associative cache.
Here, it is assumed from simulation results that the average number of instruction-memory
accesses and that of data-memory accesses per instruction execution are 1 and 0.278, respec-
tively. When we consider instruction cache and data cache together, the way-predicting
cache (WP4SACache) produces about 70% ED product reduction while the phased cache
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 51
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
I-Cache D-Cache
Nor
mal
ized
Eca
che*
Tca
che
I-Cache & D-Cache
Conventional Cache (4SACache)
Phased Cache (P4SACache)
Way-Predicting Cache (WP4SACache)
Figure 3.9: ED Product.
(P4SACache) reduces that only about 40%, from the conventional cache (4SACache).
3.5.5 Performance/Energy Overhead
Thus far, we ignored energy/performance overhead caused by the way prediction. In reality,
this overhead needs to be included in the following ways.
• Energy consumption overhead : Activating the way-predictor and accessing to the
way-prediction table dissipate extra energy.
• Cache access-time overhead : The delay for reading the way-prediction flag will directly
increase the cache access-time, because it should be completed before normal cache-
access starts.
First, we consider the energy overhead. The way-predictor can be implemented with a small
combinational logic, because MRU algorithm is very simple. Moreover, the way-predictor
52 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
0.0
0.2
0.4
0.6
0.8
1.0
0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
0.2 0.4 0.6 0.8 1.0Toverhead (Cycle)
No
rmal
ized
ED
Pro
du
ctConventional Cache (4SACache)Phased Cache (P4SACache)Way-Predicting Cache (WP4SACache)
Toverhead (Cycle)
No
rmal
ized
ED
Pro
du
ct
42.6%24.7%
0.5
Figure 3.10: The Effect of Cache Access-Time Overhead to ED Products.
predicts and writes back the modified way-prediction flag into the way-prediction table when
incorrect way-predictions take place. Since the prediction-hit rate for many programs are
very high, as shown in section 3.5.2, this energy overhead can be ignored. For every cache
access, the two-bit way-prediction flag is read from the way-prediction table. This process
may also increase the energy overhead. In case when the way-prediction table is implemented
by flip-flops, the energy dissipation will depend on the switching activity of way-prediction
flags to be read. In our simulations, it is observed that the average switching number on
the two-bit way-prediction flag for an access are 0.4(I-Cache) and 0.8(D-Cache). This energy
consumption overhead is insignificant.
Next, we discuss the cache access-time overhead. Here, we re-define the average cache
access-time of the way-predicting set-associative cache to include the way-prediction-table
access overhead as follow:
TWP4SACache = 1Cycle + (1 − PHR) × 1Cycle + ToverheadCycle (3.9)
Here, Toverhead is the average way-prediction-table access-time. Figure 3.10 shows the average
ED product of twelve benchmark programs for the conventional four-way set-associative cache
(4SACache), the phased four-way set-associative cache (P4SACache), and the way-predicting
four-way set-associative cache (WP4SACache) with access-time overhead Toverhead. For in-
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 53
struction cache, the way-predicting cache produces about 42% ED product improvement
from the conventional cache, which is better than the phased cache, even if the cache has
100% access-time overhead (Toverhead=1.0 cycle). For data cache, the effectiveness of the way-
predicting cache with 50% access-time overhead (Toverhead=0.5 cycle) is almost all same as
that of the phased cache. In addition, when the access-time overhead is 100% (Toverhead=1.0
cycle), the way-predicting cache can still make about 25% ED product reduction from the
conventional cache.
Actually, there are some methods to solve the access-time overhead problem. For example,
the delay of way-prediction-table access can be reduced by means of implementation, using
flip-flops rather than SRAM array. In addition, the access-time overhead can be hidden by
calculating the cache-index address at an earlier stage in the pipe-line[14].
3.5.6 Effects of Other Parameters
The effectiveness of the way-predicting set-associative cache depends on the prediction-hit
rates. In this section, we evaluate the effects of hardware constraints, cache size, cache-
line size, associativity, and way-prediction table size. We use three benchmarks for I-caches,
129.compress, 147.vortex, and 101.tomcatv, and four benchmarks for D-caches, 099.go, 124.m88ksim,
102.swim, and 104.hydro2d. The prediction-hit rates of 129.compress on I-cache and 124.m88ksim
on D-cache are the best of all programs. The other programs used in this section produces
lower prediction-hit rates, as shown in section 3.5.2. For all figures in this section, solid lines,
and broken lines, show the calculation results for the way-predicting four-way set-associative
cache (WP4SACache), and the phased four-way set-associative cache (P4SACache), re-
spectively. All results on each parameter are normalized to the conventional four-way set-
associative cache (4SACache). Therefore, the figures plot that how much the caches improve,
or degrade, the energy and performance over, or under, the conventional cache. In this sec-
tion, we assume that the cache size, the cache line size, and the associativity are 16 KB, 16
B, and 4, respectively (unless stated otherwise).
3.5.6.1 Cache Size
We measured the average energy consumption per cache access (ECache), and the average
cache-access time (TCache), with various cache sizes. Figure 3.11 shows simulation results.
54 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
*) Line size is 16 bytes(A) Instruction Caches
P4SACache (147.vortex)P4SACache (101.tomcatv)
P4SACache (129.compress)
WP4SACache (101.tomcatv)
WP4SACache (129.compress)WP4SACache (147.vortex)
2 4 8 16 32 64 128Cache Size [KB]
0.20
0.30
0.40
0.50
0.60
0.70
Nor
mal
ized
Eca
che
0.80
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
2 4 8 16 32 64 128Cache Size [KB]
*) Line size is 16 bytes(B) Data Caches
WP4SACache (102.swim)P4SACache (124.m88ksim)
WP4SACache (099.go)
P4SACache (102.swim)WP4SACache (124.m88ksim)
P4SACache (099.go)
P4SACache (104.hydro2d) WP4SACache (104.hydro2d)
2 4 8 16 32 64 128Cache Size [KB]
0.80
0.20
0.30
0.40
0.50
0.60
0.70
Nor
mal
ized
Eca
che
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
2 4 8 16 32 64 128Cache Size [KB]
Figure 3.11: Effects of Cache Size.
As can be seen from the figure, increasing the cache size reduces the energy consumption of
the way-predicting set-associative cache, while there is no significant difference on the phased
cache. The cache-miss rates affect the accuracy of the way prediction. Because the cache-
miss means that the desired data does not reside in the cache. Thus it is impossible to make
a correct prediction on the cache misses. In other words, increasing the cache size improves
the cache-hit rate, as the result, the prediction-hit rate is also improved. The phased cache
sacrifices the cache-hit time, while the way-predicting cache sacrifices the prediction-miss
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 55
Table 3.3: Energy and Access Time with 128 bytes Large Line Size.
Benchmarks I-Cache D-Cache
P4SACache WP4SACache P4SACache WP4SACache
TCache ECache TCache ECache TCache ECache TCache ECache
099.go 199.5% 26.3% 102.1% 26.6% 197.0% 25.7% 130.1% 47.6%
124.m88ksim 199.8% 26.4% 102.7% 27.0% 199.7% 26.4% 107.5% 30.6%
126.gcc 198.1% 26.0% 104.3% 28.2% 197.5% 25.8% 115.6% 36.7%
129.compress 200.0% 26.4% 100.0% 25.1% 196.6% 25.6% 109.9% 32.4%
130.li 200.0% 26.4% 101.4% 26.1% 198.2% 26.0% 108.2% 31.1%
132.ijpeg 200.0% 26.4% 100.6% 25.4% 199.6% 26.3% 112.1% 34.1%
134.perl 199.5% 26.3% 103.5% 27.7% 199.0% 26.2% 111.9% 33.9%
147.vortex 199.3% 26.3% 104.4% 28.3% 198.1% 26.0% 117.3% 37.9%
101.tomcatv 198.0% 25.9% 103.9% 28.0% 199.4% 26.2% 119.4% 39.5%
102.swim 200.0% 26.4% 101.0% 25.8% 168.6% 18.7% 152.4% 64.3%
103.su2cor 199.6% 26.3% 101.7% 26.3% 197.6% 25.8% 134.3% 50.7%
104.hydro2d 199.6% 26.4% 100.8% 25.6% 198.2% 26.0% 113.2% 34.9%
time. Therefore, the performance gap between the way-predicting cache and the phased
cache gets larger and larger with increase in the cache size.
3.5.6.2 Cache-Line Size
As explained in Chapter 1, we can exploit the high on-chip memory bandwidth in the
merged DRAM/logic LSI which will be one of the core devices in future system LSIs. The
high bandwidth makes it possible to increase the cache-line size without increase in cache-
replacement penalty. Therefore, it is important to evaluate the way-predicting set-associative
cache architecture with larger cache-line sizes.
For finding the effects of the cache-line size to the energy and performance of the way-
predicting cache, we measured the prediction-hit rates on the caches with various cache-line
sizes, and then calculated the average energy consumption (ECache) and the average cache-
access time (TCache). Figure 3.12 shows the calculation results.
56 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
WP4SACache (102.swim)P4SACache (124.m88ksim)
WP4SACache (099.go)
P4SACache (102.swim)WP4SACache (124.m88ksim)
P4SACache (099.go)
WP4SACache (104.hydro2d)P4SACache (104.hydro2d)
(A) Instruction Caches
(B) Data Caches
P4SACache (147.vortex)P4SACache (101.tomcatv)
P4SACache (129.compress)
WP4SACache (101.tomcatv)
WP4SACache (129.compress)WP4SACache (147.vortex)
0.10
0.20
0.30
0.40
0.50
0.60
Nor
mal
ized
Eca
che
0.70
16Cache-Line Size [byte]
32 64 128 256
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
16Cache-Line Size [byte]
32 64 128 256
0.10
0.20
0.30
0.40
0.50
0.60
Nor
mal
ized
Eca
che
0.70
16Cache-Line Size [byte]
32 64 128 256
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
16Cache-Line Size [byte]
32 64 128 256
Figure 3.12: Effects of Cache-Line Size.
The incremental instruction accesses within the large cache lines improves the MRU based
prediction-hit rate. Therefore, the energy gap between the way-predicting cache and the
phased cache gets smaller and smaller with increase in the cache-line size. On the other hand,
the best cache-line size in the D-cache is highly application dependent [36]. Accordingly, the
energy reduction achieved by the way-predicting cache also depends on the characteristics
of programs. For instance, the energy consumption for 099.go increases with increase in the
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 57
cache-line size. In 104.hydro2d, the energy is reduced until 64 bytes cache-line size, and then
it starts to increase. For almost all programs, it is observed that the energy consumption
increases when we have much large cache-line sizes. This comes from the following two
reasons. First, much larger cache-line sizes worsen the cache-hit rates when the program
have poor spatial locality. As explained in section 3.5.6.2, lower cache-hit rates bring lower
prediction-hit rates. Seconds, increasing the cache-line size reduces the total number of sets
in the cache. A number of memory references share a set, so that the way accesses in the
set could be distributed. These memory references have to share also a way-prediction flag.
As the results, the accuracy of the way prediction is degraded. However, still there is a large
performance gap between the way-predicting cache and the phased cache.
Table 3.3 shows the average energy consumption per cache access and the average cache-
access time for all benchmark programs when the caches have 128-byte cache-line size. The
way-predicting cache can reduce a lot of energy, and also can maintain the fast cache access,
as well as the 32-byte small cache-line size, reported in section 3.5.3. The advantage of
way-predicting I-caches is very clear. For all programs except for 099.go, 102.swim, and
103.su2cor, the difference of energy-reduction rates between the way-predicting cache and
the phased cache is less than 15 %, while the performance difference is much larger. For
the cache with a large cache-line size, we can summarize form the above results as follows.
If we don’t care about the performance degradation, the phased cache should be employed.
Otherwise, it is better to employ the way-predicting cache. In particular, the way-predicting
cache produces significant performance/energy improvements when the cache-line size is equal
or smaller than 128 byte.
3.5.6.3 Cache Associativity
We measured the average energy consumption per cache access (ECache), and the average
cache-access time (TCache), with various cache associativity. Figure 3.13 shows simulation
results.
For both I-caches and D-caches, the amount of energy reduction decreases with increase
in the cache associativity. Because the effect of energy consumption for activating the tag-
subarrays becomes larger. For example, tag width is 25 bits, index width is 3 bits (23 sets),
and offset width is 4 bits (24 bytes) in case of cache associativity is 128, cache size is 16 KB,
58 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
*) Line size is 16 bytes(A) Instruction Caches
P4SACache (147.vortex)P4SACache (101.tomcatv)
P4SACache (129.compress)
WP4SACache (101.tomcatv)
WP4SACache (129.compress)WP4SACache (147.vortex)
2 4 8 16 32 64 128Associativity
0.00
0.30
0.40
0.50
0.60
Nor
mal
ized
Eca
che
0.70
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
2 4 8 16 32 64 128Associativity
*) Line size is 16 bytes(B) Data Caches
WP4SACache (102.swim)P4SACache (124.m88ksim)
WP4SACache (099.go)
P4SACache (102.swim)WP4SACache (124.m88ksim)
P4SACache (099.go)
P4SACache (104.hydro2d) WP4SACache (104.hydro2d)
2 4 8 16 32 64 128Associativity
Nor
mal
ized
Eca
che
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
2 4 8 16 32 64 128Associativity
0.20
0.10
0.00
0.30
0.40
0.50
0.60
0.70
0.20
0.10
Figure 3.13: Effects of Associativity.
and cache-line size is 16 bytes. Accordingly, the 3,200 bits (25 bits × 128 way) are activated
for tag comparison, while 128 bits (16-byte cache-line size) are activated for a cache-line
access.
On the other hand, the way-predicting cache can produce significant energy reductions up
to 16 associativity. Increasing the cache associativity helps to reduce the energy consumption
due to subbanking effects. However, the higher-associative cache has many candidates of the
predicted way, so that it might worsen the accuracy of the way prediction. In case that the
3.5. EVALUATIONS: EXPERIMENTAL ANALYSIS 59
WP4SACache (102.swim)P4SACache (124.m88ksim)
WP4SACache (099.go)
P4SACache (102.swim)WP4SACache (124.m88ksim)
P4SACache (099.go)
WP4SACache (104.hydro2d)P4SACache (104.hydro2d)
(A) Instruction Caches
(B) Data Caches
P4SACache (147.vortex)P4SACache (101.tomcatv)
P4SACache (129.compress)
WP4SACache (101.tomcatv)
WP4SACache (129.compress)WP4SACache (147.vortex)
0.20
0.30
0.40
0.50
0.60
0.70
Nor
mal
ized
Eca
che
0.80
1(each)# of Sets Sharing A Way-Prediction Flag
4 16 64 256(all)
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
Nor
mal
ized
Eca
che
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Tca
che
1(each)# of Sets Sharing A Way-Prediction Flag
4 16 64 256(all)
0.20
0.30
0.40
0.50
0.60
0.70
0.80
1(each)# of Sets Sharing A Way-Prediction Flag
4 16 64 256(all) 1(each)# of Sets Sharing A Way-Prediction Flag
4 16 64 256(all)
Figure 3.14: Effects of Way-Prediction Table Size.
way-prediction accuracy reduction overcomes the subbanking effects, the energy efficiency of
the way-predicting cache is reduced. Accordingly, when we employ higher-associative cache,
the phased cache should be chosen. Otherwise, the way-predicting cache can make significant
energy reduction, so that it must be employed.
3.5.6.4 Way-Prediction Table Size
60 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
We measured the average energy consumption per cache access (ECache), and the average
cache-access time (TCache), with various way-prediction table size. Figure 3.14 shows simula-
tion results. The x-axis is the number of sets sharing a way-prediction flag. The most right
plot, denoted as 256 (all), is the result when the cache has only one way-predicting flag, i.e.,
the total number of way-predicting flag is one. Decreasing the total number of way-predicting
flag alleviates the performance/energy overhead discussed in section 3.5.5.
For I-caches, the energy-reduction rates is reduced when the number of sets sharing a way-
prediction flag is changed from 1 to 4. However, after that, the cache can maintain the energy
efficiency. On the other hand, the energy efficiency is degraded in proportion to the increase
in the number sharing sets. These behavior can be seen for the cache-access time. From
these results, we consider that sharing the way-prediction flag is a good way to alleviate the
performance/energy overhead for accessing the way-prediction table for I-caches, but not for
D-caches.
3.6 Related Work
There have been several proposals for reducing the power consumption of on-chip caches.
MDM (Multiple-Divided Module) cache [56] attempts to reduce the power consumption by
means of partitioning the cache into several small sub-caches. MDM cache requires a great
amount of hardware modification. Block buffering [48], [85], filter cache [54], and L-cache
[25] achieve low power consumption by adding a very small L0-cache between the processor
and the L1-cache. The advantage of L0-cache approaches decreases when memory reference
locality is low and cache replacement happens frequently between the L0 and L1 caches.
Hasegawa et al. [26] proposed a low-power set-associative cache architecture, which has been
compared with our way-predicting cache. Their cache is referred as phased cache in this
chapter, and it suffers from longer cache-hit time.
On the other hand, the way-predicting set-associative cache architecture can be imple-
mented with small hardware overhead, because the cache structure and memory hierarchy
of conventional memory system is maintained. Assuming the same associativity, the cache-
miss rate of the way-predicting set-associative cache is the same as that of the conventional
set-associative cache. The way-predicting set-associative cache can offer significant energy
reduction without large performance degradation when the way-prediction-hit rate is high.
3.7. CONCLUSIONS 61
3.7 Conclusions
In this chapter, the way-predicting set-associative cache for low energy consumption has been
proposed. The way-predicting cache speculatively selects one way from the set designated by
a memory address, before beginning a normal cache access. By accessing only the one way
predicted, instead of accessing all the ways, the energy consumption can be reduced.
For the way-predicting cache to perform well, the accuracy of way prediction is important.
The experimental results show that the accuracy of the MRU-based way prediction is higher
than 90% for most of the benchmark programs. It is also observed that the way-predicting
cache improves the ED product by 60–70% over the conventional set-associative cache.
To implement the MRU-based way-prediction, the way-prediction table has to be added to
the conventional cache organization. In particular, the performance penalty caused by access-
ing the way-prediction table can not be ignored. We have evaluated the performance/energy
efficiency of the way-predicting set-associative cache with the performance penalty. Then we
have also related some approach to solve this performance problem.
In addition, we have evaluated the effects of other parameters to the improvement achieved
by the way-predicting set-associative cache: cache size, cache-line size, associativity, and
size of way-prediction table. It is observed that increasing the cache size produces better
results. The trend increases the cache size to confine more memory accesses in on-chip for
achieving high performance and low power consumption. Therefore, we believe that the way-
predicting set-associative cache architecture is usable for future LSIs. Moreover, we have
reported that decreasing the size of way-prediction table is a promising way to alleviate the
performance/energy overhead for accessing the way-prediction table.
There are many alternatives for the way prediction, such as hash-rehash caches and column-
associative caches [52], [1], [14], which have been proposed for performance considerations.
Kim et. al. [53] reported the MRU approach consumes the least energy. In this chapter,
we have assumed that the way-predicting cache-access time on prediction hit is same as
the cache-hit time of a conventional set-associative cache. Actually, the cache-access time
on prediction hit is faster, as the hash-rehash caches and column-associative caches. As
mater which way-prediction algorithm is employed, this kind of behavioral approach is very
promising for future high-performance/low-power system LSIs.
62 CHAPTER 3. WAY-PREDICTING SET-ASSOCIATIVE CACHE ARCHITECTURE
Chapter 4
History-Based Tag-Comparison
Cache Architecture
4.1 Introduction
On-chip caches have been playing an important role in achieving high performance processors.
In particular, much higher performance is required for instruction caches because one or more
instructions have to be issued on every clock cycle. In other words, from energy point of view,
the instruction cache consumes a lot of energy. Therefore, it is strongly required to reduce
the energy consumption for instruction-cache accesses.
In cache accesses, the tag indexed by the memory address is read from tag memory. Then
the tag is compared with the tag-portion in the memory address to determine whether the
entry in the cache corresponds to the requested address. If the tag is equal to the tag-portion,
then the access hits in the cache. Otherwise, a cache miss occurs. Therefore, the energy for
the tag comparison and that for the data access (data read or write) are consumed on every
cache access.
In this chapter, we focus on the energy consumed for the tag comparison, and propose
a novel architecture for low-power direct-mapped instruction caches, called “history-based
tag-comparison cache”. The cache predicts the residence of instructions to be fetched before
the tag comparison is performed. If the prediction is correct, the tag comparison can be
omitted. In this case, the cache does not need to waste the energy for the tag comparison.
Our method guarantees completely correct predictions.
63
64 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
The rest of this chapter is organized as follows. Section 2 shows the effect of tag compar-
ison on the total cache energy. In addition, another technique to omit the tag comparison
proposed in [70] is explained as a comparative method. Section 3 presents the concept and
mechanism of our history-based tag-comparison cache. Section 4 reports evaluation results
for energy efficiency of the history-based tag-comparison cache. Moreover, the effects of
hardware constraint is analyzed. Section 5 shows related work, and Section 6 concludes this
chapter.
4.2 Breakdown of Cache-Access Energy
In direct-mapped instruction caches, tag comparison and data read are performed in parallel.
Thus, the total energy consumed for a cache access has two factors: the energy for the tag
comparison and that for the data read. Here, we assume that the logic portion, comparators
for the tag comparison and multiplexors for the data read, does not dissipate any energy.
Therefore, we need to consider the energy for tag-memory accesses and data-memory accesses.
In conventional caches, the height of the tag memory and that of the data memory are
equal, but not for the width. Because the memory width depends on the tag size and the
cache-line size. Usually, the tag size is much smaller than the cache-line size. For example,
in case of a 16 KB direct-mapped cache having 32-byte lines, the cache-line size is 256 bits
(32×8), while the tag size is 18 bits (32bits word - 9bits index - 5bits offset). Thus, the total
cache energy is dominated by data-memory accesses.
Cache subbanking is one of approach to reducing the data-access energy. The data-memory
array is partitioned into several subbanks, and only the subbank which includes the desired
data is activated [85]. As the bank address can be given by the memory address, the cache
has no access-time overhead. Figure 4.1 depicts the breakdown of energy consumption for
a 16 KB direct-mapped cache with various number of subbanks. We have calculated the
energy consumption based on Kamble’s model [48]. The energy for I/O drivers and address
decoder in the model are not included. The horizontal axis in the figure shows the total
number of subbanks. Ebit data and Ebit tag are the energy consumed on bit-lines in the data
memory and the tag memory while a cache access is performed, respectively. All results are
normalized to the base configuration denoted as “8(1)”, in which there is no subbanking. It
is clear from the figure that increasing the number of subbanks reduces a lot of energy for
4.3. INTERLINE TAG-COMPARISON CACHE 65
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Nor
mal
ized
Ene
rgy
Con
sum
ptio
n
1 (8) 2 (4) 4 (2) 8 (1) 1 (8) 2 (4) 4 (2) 8 (1)
# of Subbanks (# of Words in a Subbank)
32-bit word 64-bit word
OthersEbit_data
Ebit_tag
Figure 4.1: The Energy Effect of Tag Comparison.
the data memory. Since the tag-access energy is maintained, however, the effect of the tag
comparison becomes a significant on the total energy consumption. When the word size is
32 bits, Ebit tag wastes about 30 % energy. If the word size is 64 bits, Ebit tag occupies almost
half of the total energy.
4.3 Interline Tag-Comparison Cache
As explained in Section 4.2, it is important to reduce the tag-access energy for obtaining
more energy reduction in low-power caches. A technique to reduce the frequency of the tag
comparisons has been proposed [70], which is referred to as interline tag-comparison cache
in this chapter.
In case of two instructions i and j are executed successively, we can consider the following
66 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
cases:
• Intraline sequential flow: i and j reside in the same cache line, and their addresses are
sequential.
• Intraline non-sequential flow: i and j reside in the same cache line, and their addresses
are not sequential. Therefore, i is a taken-branch or jump instruction.
• Interline sequential flow: i and j reside in the different cache lines, and their addresses
are sequential.
• Interline non-sequential flow: i and j reside in the different cache lines, and their ad-
dresses are not sequential. Therefore, i is a taken-branch or jump instruction.
All instructions in a cache line are filled into the cache at a time when one of the instructions
occurs a cache miss. Therefore, the residence of instruction j is guaranteed when i and j
are in the intraline sequential flow or intraline non-sequential flow. In this case, the tag
comparison for instruction j can be omitted. Namely, the cache needs to perform the tag
comparison only when two successive instructions are in the interline sequential flow or the
interline non-sequential flow. The interline flows can be detected by comparing the current
PC with the previous one. Another approach to finding the interline flows is to analyze the
compiled program code. In this case, unused space in the operation code will be used as a
compiler hint.
4.4 History-Based Tag-Comparison Cache
4.4.1 Concept
The content of cache memory is updated when cache misses take place. Instruction caches
can achieve much higher cache-hit rates due to rich locality of memory references. This means
that the content of instruction caches is rarely updated.
There are many loops in programs, so that some instruction blocks will be executed in
many times. In this chapter, we call a run-time instruction block “a dynamic basic-block”.
The dynamic basic-block consists of one or more successive basic blocks. The top of the
dynamic basic-block is addressed by a branch-target address, and the end of it is addressed
4.4. HISTORY-BASED TAG-COMPARISON CACHE 67
by a taken-branch or jump address. Therefore, not-taken conditional branches might be
included in the dynamic basic-block.
We consider where a dynamic basic-block is executed in many times during program ex-
ecution. On the first time of the dynamic basic-block execution, the tag comparison for all
instructions has to be performed. However, on the second execution, if no cache miss has
occurred since the first execution, it is guaranteed that the dynamic basic-block resides in the
cache. Hence, we can determine that the indexed cache entry corresponds to the requested
address without performing the tag comparison.
When a dynamic basic-block is executed, the history-based tag-comparison cache attempts
to avoid unnecessary tag comparisons by detecting the following conditions:
1. the dynamic basic-block has been executed, and
2. no cache miss has occurred since the previous execution of the dynamic basic-block.
The history-based tag-comparison cache omits the tag comparison when the above conditions
are satisfied regardless of the intraline- and interline-flows.
4.4.2 Organization
To detect the conditions for omitting the tag comparison, as explained in Section 4.4.1, ex-
ecution footprints are recorded. The footprint indicates whether the corresponding dynamic
basic-block resides in the cache. If a dynamic basic-block left the footprint at the previ-
ous execution, then the tag comparisons for current execution are omitted. All footprints
are erased when a cache miss takes place, because a dynamic basic-block (or a part of the
dynamic basic-block) might be evicted from the cache.
Recent high-performance processors employ a branch-prediction unit to solve control-
hazard problems. A conventional branch-prediction unit consists of a BPT (branch predictor
table) and a BTB (branch target buffer). Each entry of the BTB has a branch-address field
and a target-address field. If a matching entry to the current program counter (PC) is found
on BTB lookup, the corresponding target address is stored to the PC for the next instruction
fetch. When an unregistered taken branch is performed, a new entry will be added.
The execution footprints for the history-based tag-comparison cache are implemented in
the BTB with additional information. Figure 4.2 depicts an organization of the extended
68 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
Branch Addr Target Addr
Branch Addr Target Addr
BTB(Branch Target Buffer)
RCT RCN
TCO
Dynamic Basic-BlockTop
NotTaken
Tail
Direct-mappedInstruction Cache
Tag Comparison Enable
PredictionResult
Inc
PC
Address
Figure 4.2: Organization.
BTB. The following flags are added to each BTB entry:
• RCT (Residing in Cache on Taken) 1-bit flag per entry : This is an execution footprint
for a dynamic basic-block, the top of which is addressed by the corresponding target
address. This flag is set to 1 when the corresponding branch is taken, and is reset to 0
whenever a cache miss occurs.
• RCN (Residing in Cache on Not-taken) 1-bit flag per entry : This is an execution
footprint for fall-though instructions. This flag is set to 1 when the corresponding
branch is not-taken, and is reset to 0 whenever a cache miss occurs.
In addition, a flag to enable the tag comparison in the cache is required:
• TCO (Tag-Comparison Omit) 1-bit flag : This flag indicates whether the tag compar-
ison can be omitted. If this flag is 1, the tag comparison is not performed.
The TCO does not appear on the cache critical-paths. Hence, the history-based tag-comparison
cache does not have any cache-access-time overhead.
4.4. HISTORY-BASED TAG-COMPARISON CACHE 69
4.4.3 Operation
The execution footprints (i.e., RCT and RCN flags) are left according to run-time program-
execution behavior, and are erased whenever a cache miss occurs. In addition, all footprints
have to be erased when BTB replacements take place. Because the scope of the RCT and
RCN flags is defined by the target address in the corresponding BTB entry and the branch
address in another BTB entry. The scope information might be evicted from the BTB by
the replacements. In this case, the cache loses the way to detect that how long the tag
comparison can be omitted. Figure 4.3 (A) presents the operation flow on BTB lookups, and
the extended BTB behaves as follows:
1. When a cache miss occurs, all footprints in the BTB are erased (all RCT and RCN
flags are reset to 0). Note that there is no cache misses when the TCO flag is 1.
2. If a matching entry is found in the BTB, the next status is performed. Otherwise,
the operation state moves to the initial state, then the cache starts to fetch the next
instruction.
3. The return address stack (RAS) improves the accuracy of branch prediction [46]. How-
ever, we have not extended the RAS for recording the execution footprints. Therefore,
the TCO has to be reset to 0 whenever the target address is provided from the RAS.
4. There is a matching entry in the BTB, so that the TCO flag and the footprint are
modified. If the branch-prediction result is taken, the RCT flag in the matching entry
is stored to the TCO flag. Then the RCT flag is set to 1 as the execution footprint.
Otherwise, the RCN flag is treated as well as the RCT flag.
When a wrong branch-prediction is detected, the PC has to be recovered to guarantee
the correct execution. This recovery might cause BTB updates. Figure 4.3 (B) shows the
extended BTB operation on the wrong-prediction recovery, and the extended BTB works as
follows:
5. There are two cases for the BTB update: one is the registration of a new entry and
the other is the modification of the target-address field in a entry. If the registration
evicts another BTB entry, all footprints are erased. Then the TCO flag is reset to 0
70 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
Start
Cache hit?
BTB hit?
RAS?
Taken prediction?
RCT -> TCO
1 -> RCT
RCN -> TCO
1 -> RCN
go to start
(A) On BTB lookup
Start
BTB update?
Wrong Prediction?
Replacement?0 -> all RCTs0 -> all RCNs
0 -> all RCTs0 -> all RCNs
0 -> TCO 0 -> TCO
1 -> RCT0 -> RCN
go to start
1 -> RCN
RCN -> TCO
TCO -> RCT TCO -> RCN
1 -> RCT
RCT -> TCO
(B) On Wrong-Branch Recovery
no
yes
yesyes
no
yes no
yes
no
no
noyes
yes
no1
2
3
4
5
6
Figure 4.3: Operation State Diagram.
because the dynamic basic-block addressed by the new target address may not reside
in the cache. In addition, the RCT flag is set to 1 as the execution footprint.
6. Only the branch direction is wrong. Thus, the TCO is stored back to the RCT or RCN
flags. Then the correct footprint is stored to the TCO, and the execution footprint is
recorded.
Figure 4.4 (A) shows an example of program execution flow which has seven-times iteration.
The size of the dynamic basic-block is varied by the three conditional branches addressed by
B, C, and D. The solid and broken lines in the figure represent the control flow of the loop
execution. Figure 4.4 (B) shows the conditions of the extended BTB. The pair of a number
and a capital letter in the figure denotes when the BTB is updated. For instance, 1 − C
means that the BTB is updated on the branch-C execution in the iteration 1. In the first
iteration, the tag comparisons are performed. Then the history-based tag-comparison cache
works as follows:
1-C : A new entry for the branch-C is registered in the BTB. Since the TCO is still 0,
performing the tag comparison is continued in the iteration 2. The RCT flag is set to 1 as
4.4. HISTORY-BASED TAG-COMPARISON CACHE 71
(A) Execution Flow
Execution FlowTop
Branch to F
Branch to A
Branch to A
A
B
C
D
F Top
1 2 3 4 5 6 7# of Iteration
1-C
2-C
3-C
4-C
4-D
5-C
5-D
6-C
6-D
7-B
NewEntry
NewEntry
NewEntry
Time:Iteration-BranchAddr
RCT RCN TCO RCN of Branch-C at 4-D
0 1
Branch-C
Branch-C
Branch-C
Branch-C
Branch-C
Branch-C
Branch-C
Branch-C
Branch-C
Branch-CBranch-D
Branch-D
Branch-D
Branch-D
Branch-D
Branch-D
Branch-B
(B) BTB Consitions
Figure 4.4: Example of Operation.
the execution footprint.
2-C : The footprint (RCT flag) recorded at 1 − C is stored to the TCO. Thus, no tag
comparison is performed in the iteration 3.
3-C : The footprint (RCT flag) recorded at 2 − C is stored to the TCO. Therefore, no tag
comparison is performed in the iteration 4.
4-C : The condition of branch-C is not-taken. In this case, the size of the dynamic basic-
block is increased. The execution footprint for the fall-through instructions (i.e., the RCN
flag) is stored to the TCO. Then the RCN flag is set to 1. The tag comparison is resumed.
4-D : As a new entry for the branch-D is registered in the BTB, the TCO is reset to 0.
Therefore, in the iteration 5, performing the tag comparison is continued.
5-C : the RCN flag corresponding to the branch-C recorded at 4−C is stored to the TCO.
As the TCO is set to 1, the tag comparisons for the remaining instructions in the iteration
5 are omitted.
72 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
Table 4.1: Benchmark Programs.
Programs Inputs
SPECint95 099.go,
124.m88ksim, 126.gcc,
129.compress, 130.li,
132.ijpeg, 134.perl,
147.vortex
training
SPECfp95 102.swim, 107.mgrid,
110.applu, 125turb3d,
141.apsi
test
5-D : The footprint (RCT flag) recorded at 4−D is stored to the TCO, so that performing
that tag comparison is suspended.
6-C : the RCN flag recorded at 5 − C is stored to the TCO, and the tag comparisons are
omitted.
6-D : the RCT flag recorded at 5 − D is stored to the TCO, and the tag comparisons are
omitted.
7-B : The condition of branch-B is taken. In this case, the size of the dynamic basic-block
is decreased. Since it is the first taken condition for the branch-B, a new entry is registered
in the BTB. Therefore, the TCO is reset to 0, and the tag comparison is resumed for the
target dynamic basic-block.
4.5 Evaluations
In this section, we evaluate the energy efficiency of the history-based tag-comparison cache
by comparing with a conventional cache and the interline tag-comparison cache. The combi-
nation of the history-based tag-comparison cache and the interline tag-comparison cache is
also evaluated.
4.5. EVALUATIONS 73
4.5.1 Simulation Environment
In this evaluation, eight integer programs using train input and five floating-point programs
using test input from the SPEC95 benchmark suit are used [82], as shown in Table 4.1. The
benchmark programs are executed on the SimpleScalar simulator [11]. We have modified the
simulator to implement the history-based tag-comparison cache. For each program, the total
count of tag comparisons on the following caches is measured:
• C-TC (Conventional Tag-Comparison cache) : The tag comparison is performed on
every instruction fetch. This is the base model in this evaluation.
• IL-TC (InterLine Tag-Comparison cache) : The tag comparison is performed only on
interline flows, as explained in section 4.3
• H-TC (History-based Tag-Comparison cache) : The tag comparison is performed ac-
cording to the TCO flag, as explained in section 4.4.
• H-TC ideal : This is the same as the H-TC cache except that it has a perfect in-
struction cache (i.e., no cache miss) and a full associative BTB (i.e., no BTB-conflict
miss).
• HIL-TC (History-based InterLine Tag-Comparison cache) : This is a combination of
the IL-TC cache and the H-TC cache. The tag comparison is performed if the TCO
flag is 0 and the fetched instruction is on the interline flows.
Unless stated otherwise, the following configuration is assumed: cache size is 32 K bytes,
cache-line size is 32 bytes, the number of direct-mapped BPT entry is 2048, the number of
BTB set is 512, the BTB associativity is 4, the RAS size is 8.
4.5.2 Energy Reduction for Tag Comparisons
Table 4.2 shows the normalized total count of tag comparisons to the conventional cache
(C-TC). First, we compare the history-based tag-comparison cache (H-TC) with the conven-
tional cache (C-TC) and the interline tag-comparison cache (IL-TC). Since there are many
incremental accesses in almost all programs, the interline tag-comparison cache works well for
all programs. While the effectiveness of the history-based tag-comparison cache is application
74 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
Table 4.2: Normalized Tag-Comparison Counts.
Benchmark C-TC IL-TC H-TC H-TC HIL-TC
ideal
099.go 1.000 0.3203 0.7604 0.4027 0.2378
124.m88ksim 1.000 0.3302 0.4217 0.1856 0.1361
129.compress 1.000 0.3528 0.1751 0.1718 0.0706
126.gcc 1.000 0.3343 0.6810 0.2812 0.2278
130.li 1.000 0.3515 0.4500 0.1811 0.1684
132.ijpeg 1.000 0.2992 0.1062 0.0560 0.0311
134.perl 1.000 0.3436 0.6643 0.1361 0.2249
147.vortex 1.000 0.3213 0.8838 0.2141 0.2837
102.swim 1.000 0.2957 0.0623 0.0622 0.0278
107.mgrid 1.000 0.2600 0.0008 0.0002 0.0002
110.applu 1.000 0.2657 0.0252 0.0248 0.0070
125.turb3d 1.000 0.2813 0.0849 0.0727 0.0266
141.apsi 1.000 0.2801 0.1050 0.0476 0.0307
dependent. The history-based tag-comparison cache produces the more reduction than the
interline tag-comparison cache for two integer programs, 129.compress and 132.ijpeg, and for
all floating-point programs. In particular, the cache reduces more than 90 % tag comparisons
for the floating-point programs. This result can be understood by considering the charac-
teristics of the programs. The floating-point programs and the media application programs
have relatively well structured loops. The history-based tag-comparison cache attempts to
avoid the unnecessary tag comparisons by exploiting the iterative execution in the programs.
Figure 4.5 shows the total count and total energy dissipated for tag comparisons during
the execution of each program. In the figure, we assume that the average energy dissipated
for a tag comparison (ETag) is 1.1 nano joules. Etag is calculated based on the Kamble’s
model [48], and the 0.8 micron CMOS cache design described in [95] is assumed. In order to
obtain the value of ETag by using the Kamble’s model, the following parameters are assumed
(A more detailed explanation of the calculation is presented in Section 5.6.3):
4.5. EVALUATIONS 75
0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.03.23.43.63.84.04.24.44.64.8
9.09.29.4
16.817.017.2
19.019.219.419.6
099.go124.m88ksim
126.gcc129.compress
130.li132.ijpeg
134.perl147.vortex
102.swim107.mgrid
110.applu125.turb3d
Benchmarks
To
tal En
ergy D
issipated
for T
ag C
om
pariso
ns w
here E
tag is 1.1 n
J [Jou
le]
Total Count(6,249,776)
Total Count(3,543,203)
To
tal C
ou
nt
of
Tag
Co
mp
aris
on
s [x
10 ]9
0.000.220.440.660.881.101.321.541.761.982.202.422.642.863.083.303.523.743.964.184.404.624.845.065.28
9.9010.1210.34
18.4818.7018.92
20.9021.1221.3421.56
Conventional Cache
History-Based Tag-Comparison Cache
141.apsi
Figure 4.5: Total Energy Dissipated for Tag Comparisons for The Execution of Programs.
• Total number of rows, or Nrow, is 1024 (32KB32B
= 1024sets).
• Cache-line bits, or L, is 0 because ETag does not include the energy dissipated in data-
subarray.
• Tag bits, or T is 17 because index bits is 10 and offset bits is 5 (32 − 10 − 5 = 17).
• Associativity, or M , is 1.
76 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
It is observed from the Figure 4.5 that the history-based tag-comparison cache makes signif-
icant energy reductions, in particular, for floating-point programs.
Next, we compare the ideal history-based tag-comparison cache (H-TC ideal) with the
realistic one (H-TC). It can be seen from the results of the ideal cache that the history-based
tag-comparison cache has the potential to achieve the better results than the interline tag-
comparison cache for all programs, with the exception for 099.go. However, the realistic cache
with hardware constraint (H-TC) does not make significant improvements for some integer
programs, 099.go, 126.gcc, 134.perl, and 147.vortex. For these programs, the tag-comparison
reduction achieved by the interline tag-comparison cache is about 70 %, while that produced
by the realistic history-based tag-comparison cache is only from 12 % to 34 %. The difference
between the ideal cache (H-TC ideal) and the realistic cache (H-TC) are analyzed in section
4.5.3.
Finally, we discuss the efficiency of the history-based interline tag-comparison cache. We
can see from the simulation results that the combination of the history-based tag-comparison
and the interline tag-comparison makes a significant reduction. In the best case of 107.mgrid,
the total count of tag comparisons is reduced to less than 0.01 %. Even if in the worst case
of 147.vortex, the combination reduces more than 70 % tag comparisons.
4.5.3 Effects of Hardware Constraints
When cache misses or BTB replacements take place, all footprints in the BTB are erased. In
this section, we analyze the effects of the cache size and the BTB associativity. Four integer
programs, 132.ijpeg, 099.go, 126.gcc, and 147.vortex are used in this analysis. The history-
based tag-comparison cache works well for the 132.ijpeg, but not for the other programs, as
reported in Section 4.5.2.
4.5.3.1 BTB Associativity
Figure 4.6 shows the total count of tag comparisons on the history-based tag-comparison
cache (H-TC) with various BTB associativity. Note that the BTB size is maintained. In
addition, the cache size is 32 KB except that the ideal cache (H-TC ideal) has a perfect
instruction cache. All results are normalized to the conventional cache (C-TC).
It is clear from the figure that there is no significant improvement even if the BTB associa-
4.5. EVALUATIONS 77
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
132.ijpeg 099.go 126.gcc 147.vortex
Benchmark Programs
Nor
mal
ized
# o
f Tag
-Com
paris
onH-TC (2way BTB)H-TC (8way BTB)H-TC (32way BTB)H-TC (128way BTB)
H-TC (512way BTB)H-TC (full asociative BTB)H-TC ideal
Figure 4.6: Effect of the BTB Associativity.
tivity is increased. The gap between the realistic cache (H-TC) and the ideal cache (H-TC
ideal) is still large. This trend can be seen for all benchmark programs. We consider that the
BTB has enough capacity for the programs. Therefore, the BTB conflict rarely takes place
even if the BTB has a small number of associativity.
4.5.3.2 Cache Size
The cache-hit rates affect directly the efficiency of the history-based tag-comparison cache.
Figure 4.7 shows the simulation results of the history-based tag-comparison cache (H-TC)
with various cache size. All results are normalized to the conventional cache (C-TC). Note
that the basic BTB configuration is maintained except that the ideal cache (H-TC ideal) has
the full-associative BTB.
For all programs, increasing the cache size improves the efficiency of the history-based tag-
78 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
132.ijpeg 099.go 126.gcc 147.vortex
Benchmark Programs
Nor
mal
ized
# o
f Tag
-Com
paris
onH-TC (4KB Cache)H-TC (8KB Cache)H-TC (16KB Cache)H-TC (32KB Cache)
H-TC (64KB Cache)H-TC (128KB Cache)H-TC (Perfect Cache)H-TC ideal
Figure 4.7: Effects of Cache Capacity.
comparison cache. In particular, where the cache size exceeds 64 KB, the realistic history-
based tag-comparison caches reduce the total count of tag comparisons as well as the ideal
cache, for 132.ijpeg. For the other programs, the gap between the realistic cache and the
ideal cache is decreased with increase in the cache size. We have measured the breakdown
of the footprint-erase operations for 099.go. As a result, 98 % of footprint-erase operations
are caused by cache misses, and the remaining 2 % are caused by the BTB replacements.
Therefore, the efficiency of the history-based tag-comparison is strongly affected by the cache
size rather than the BTB replacements. The trend has been increasing the cache size, so that
the efficiency of the history-based tag-comparison cache will also be increased.
4.5. EVALUATIONS 79
4.5.4 Energy Overhead
In this section, we discuss the energy overhead of the extended BTB for the history-based
tag-comparison cache. As explained in section 4.4.2, only 2 bits per entry are added: one
for the RCT flag and one for the RCN flag. We need to consider the energy consumed for
reading or writing the footprints and that for erasing all footprints.
The energy consumed for reading or writing the footprint depends on the implementation
of the BTB. If the pipelined access is employed, the BTB lookup is performed first. Then the
target address and the footprints in the matching entry is accessed. If there is no matching
entry, the BTB access is finished without accessing the target address and the footprints. In
case of this implementation, accessing to the footprints is performed on BTB hits. Only when
the instruction being fetched is a branch or jump, the BTB hit is occurred. Therefore, the
energy for accessing to the footprints is consumed on every branch or jump execution, which
are already registered in the BTB. On the other hand, the tag comparison in conventional
caches is performed on every clock cycle. Therefore, the energy overhead is trivial.
For high-performance implementation of the set-associative BTB, the branch addresses
and the target addresses in the indexed set are read in parallel. Hence, the RCT and RCN
flags are also read in parallel. In this case, the energy overhead will appear on every cycle.
The total number of the footprints to be accessed depends on the BTB associativity. Where
the BTB has the n-way set-associative organization, (RCT + RCN) × n bits are accessed.
If the BTB has much higher associativity (i.e., a large number of n), the energy overhead
becomes a serious problem.
Next, we consider the energy consumed for erasing all footprints (i.e., for reseting all RCT
and RCN flags to 0). This energy overhead depends on how many footprints are reset from
1 to 0. Table 4.3 shows the energy overhead of the extended BTB in terms of the number
of erased footprints. The left most column noted as “Total” is the total number of erased
footprints during the program execution. The middle column noted as “per erase” and the
right most column noted as “per i-fetch” are the average number of erased footprints per
footprint-erase operation and per instruction fetch, respectively. For almost all programs,
less than 6 footprints are erased on a footprint-erase operation in average. The conventional
caches need to read the whole tag data, while the history-based tag-comparison cache erase
less than 0.1 bit footprint, on every clock cycle. Therefore, we believe that the energy
80 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
Table 4.3: Total Number of Erased Footprints.
Benchmark Total per erase per i-fetch
099.go 44,765,053 1.576 0.082
124.m88ksim 5,004,912 5.780 0.042
129.compress 21,114 5.504 0.001
126.gcc 113,901,538 1.696 0.088
130.li 8,602,721 9.374 0.047
132.ijpeg 4,045,877 2.322 0.003
134.perl 201,247,564 3.234 0.084
147.vortex 181,134,297 2.270 0.072
102.swim 7,482 1.167 0.000
107.mgrid 256,370 0.996 0.000
110.applu 109,144 0.565 0.000
125.turb3d 22,937,475 5.612 0.001
141.apsi 23,492,530 1.525 0.003
overhead can be ignored.
4.6 Related Work
Panwar et al. have proposed the concept of conditional tag comparison to reduce the fre-
quency of the tag comparisons, and have presented the interline tag-comparison [70], as
explained in Section 4.3. The interline tag-comparison cache omits the tag comparison if
instructions have intraline flows, whereas the history-based tag-comparison cache can omit
it not only on the intraline flows but also on the interline flows.
The S-cache has also been proposed in [70]. The S-cache is a small added memory to the
L1 cache, and has statically allocated address space. No cache replacement occurs in the S-
cache. Therefore, the tag comparison is unnecessary because the S-cache accesses always hit.
The scratchpad-memory [69], the loop-cache [8] [7], and the decompressor-memory [38] also
employ this kind of a small memory, and have the same effect as the S-cache. The scratchpad-
4.7. CONCLUSIONS 81
memory and the loop-cache analyze the programs, then the compiler allocates well executed
instructions to the small memory. For the S-cache and the decompressor-memory, prior
simulations using input-data set are required to optimize the code allocation. Their works
differ from ours in two aspects. First, these caches require a static analysis. Second, the cache
module has to be separated to a dynamically allocated memory space (i.e., main cache) and a
statically allocated memory space (i.e., the small cache). The history-based tag-comparison
cache does not require these arrangements.
4.7 Conclusions
In this chapter, we have proposed the history-based tag-comparison cache for low-energy
consumption. The history-based tag-comparison cache exploits the following two facts. First,
instruction-cache-hit rates are much higher. Second, almost all programs have many loops.
The cache records the execution footprints, and determines whether the instructions to be
fetched are currently cache resident without tag-lookup. Therefore, the cache can reduce the
energy consumed for the tag comparisons. The branch target buffer (BTB) is extended to
record the execution footprints.
We have evaluated the efficiency of the history-based tag-comparison cache. It is observed
that more than 90 % of tag-comparison counts are reduced in many benchmark programs. In
addition, the combination of our history-based tag-comparison cache and the interline tag-
comparison cache makes a remarkable reduction, in half of benchmark programs the total
count of tag comparison is reduced by more than 95 %. Moreover, we have analyzed the effects
of hardware constraint: cache size and BTB associativity. As a result, it is observed that
the efficiency of the history-based tag-comparison cache is improved by increasing the cache
size. The trend has been certainly increasing the on-chip cache size. Therefore, we believe
that the history-based tag-comparison is one of superior approach to achieving low-power
instruction caches for future processor chips.
82 CHAPTER 4. HISTORY-BASED TAG-COMPARISON CACHE ARCHITECTURE
Chapter 5
Variable Line-Size Cache
Architecture
5.1 Introduction
Recent remarkable advances of VLSI technology have been increasing processor speed and
DRAM capacity dramatically. However, the advances also have introduced a large and
growing performance gap between the processor and DRAM, this problem is referred to
as “Memory Wall” [12], [97], resulting in poor total system performance in spite of higher
processor performance. Integrating processors and DRAM on the same chip, or merged
DRAM/logic LSI, is a good approach to solve the “Memory Wall” problem [72]. Merged
DRAM/logic LSIs provide high on-chip memory bandwidth by interconnecting the processors
and DRAM with wider on-chip busses. In addition, the design space of memory hierarchy
for merged DRAM/logic LSIs becomes so broad that the designer could choose an option
from various on-chip memory-path architectures. We can classify the on-chip memory-path
architectures as shown in Figure 5.1:
• DRCM (Datapath–Register–Cache–MainMemory) architecture: This architecture comes
straight from the common memory hierarchy, which is widely employed in recent com-
mercial processor chips. The on-chip memory-path consists of datapath, registers, cache
memory, and main memory. The high on-chip memory bandwidth is exploited between
the cache and the main memory on cache replacements [64], [81], [13], [78].
83
84 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Discrete LSIs Merged DRAM/logc LSIs
Datapath
Registers
Cache (SRAM)
Main Memory (DRAM)
Datapath
Registers
Cache (SRAM)
Main Memory (DRAM)
Datapath
Main Memory (DRAM) Main Memory (DRAM)
Registers
Datapath
Figure 5.1: On-Chip Memory-Path Architectures.
• DRM (Datapath–Register–MainMemory) architecture: This architecture is based on
vector processing. The on-chip memory-path consists of datapath, vector registers, and
main memory. The high on-chip memory bandwidth is exploited on vector load/store
operations [73].
• DM (Datapath–MainMemory) architecture: This architecture is based on the direct
calculations, in which all operands reside in the main memory. The on-chip memory-
path consists of datapath, and main memory. The high on-chip memory bandwidth is
exploited on ALU operations [27]
Which memory-path architecture should be employed depends largely on the characteristics
of target programs. Among these candidates for on-chip memory path architectures, we focus
on the memory hierarchy including cache memory, the DRCM architecture. On-chip SRAM
caches are still necessary for most programs to hide large DRAM-access latency even if the
processors and DRAM are integrated on the same chip.
This chapter introduces the concept of a novel cache architecture for merged DRAM/logic
LSIs, called “Variable Line-Size Cache (VLS cache). The VLS cache attempts to make good
use of the attainable high on-chip memory bandwidth by optimizing the cache-line size.
A large cache-line size can benefit some programs with rich spatial locality of references
due to the effect of prefetching. A small cache-line size makes it possible to reduce the
frequency of cache-line conflict without any access-time overhead. In addition, decreasing
5.2. CONVENTIONAL APPROACHES TO EXPLOITING HIGH MEMORY-BANDWIDTH85
the cache-line size reduces the energy consumed for on-chip main-memory accesses. This
chapter also introduces two VLS caches: a statically variable line-size cache (S-VLS cache)
and a dynamically variable line-size cache (D-VLS cache). In addition, we evaluate the
performance/energy efficiency of the VLS caches using many benchmark programs.
The rest of this chapter is organized as follows. Section 2 shows a conventional approach
to exploiting the high on-chip memory bandwidth, and the advantages and disadvantages
are cleared. Section 3 gives the concept of the VLS cache architecture as one of approach
to solving the disadvantages. Section 4 and Section 5 propose two types of the VLS cache:
a statically VLS cache and a dynamically VLS cache. Section 6 presents some simulation
results and evaluates the performance/energy efficiency of the VLS caches. In addition, the
dynamically VLS cache is analyzed in detail. Section 6 shows related work, and Section 7
concludes this chapter.
5.2 Conventional Approaches to Exploiting High Memory-
Bandwidth
In merged DRAM/logic LSIs with the memory hierarchy including cache memory, the high
on-chip memory bandwidth can be exploited on cache replacements.
Again, we show the definition of average memory-access time, Equation (2.1) and (2.2),
explained in Chapter 2.
AMAT = TCache + CMR × 2 × TMainMemory.
TMainMemory = TDRAMarray +LineSize
BandWidth.
Even if LineSize increases within the range of BandWidth, assuming a constant DRAM
access time, the miss penalty will not be increased. Since BandWidth in traditional computer
systems is very small due to the I/O-pin bottleneck, the miss penalty will increase if we
increase LineSize. On the other hand, BandWidth of merged DRAM/logic LSIs can be
enlarged dramatically due to lack of the I/O-pin limitation. The high bandwidth is easily
realized by widening the on-chip busses. Therefore, designers can increase the cache-line size
within the range of the enlarged BandWidth in a constant TDRAMarray. Generally, large cache
86 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16 32 64 128 256Line-Size(bytes)
Miss Ratio(%)
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16 32 64 128 2560.0
2.0
4.0
6.0
8.0
10.0
16 32 64 128 2560.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
16 32 64 128 256
052.alvinn072.sc
104.hydro2d 099.go134.perl
103.su2cor
101.tomcatv
126.gcc132.ijpeg
(a) (b) (c)
Figure 5.2: The Effects of Cache-Line Size to Cache-Miss Rates.
lines can benefit some programs with rich spatial locality due to the effect of prefetching.
Consequently, in merged DRAM/logic LSIs, the designer can positively take the advantage
of spatial locality inherent in programs. For example, since instruction references have rich
spatial locality in almost all programs, increasing the cache-line size makes a significant
performance improvement [78].
Unfortunately, since conventional caches employ a single cache-line size, increasing the
cache-line size is the only approach to exploiting the high on-chip memory bandwidth. How-
ever, increasing the cache-line size results in reducing the total number of cache lines which
can be held in the cache memory. Thus, large cache lines might worsen the cache-miss rates
due to frequent cache-line evictions if programs have poor spatial locality. Actually, the
spatial locality of data references depends on the characteristics of programs, and general
purpose processors have to execute a number of programs. Figure 5.2 shows how the cache-
miss rate is affected by the cache-line size in a 16 KB direct-mapped data cache. 052.alvinn
and 072.sc, from the SPEC92 benchmark program suite, are executed using the reference in-
put. The other integer programs and floating-point programs, from the SPEC95 benchmark
program suite, are executed using the training input and the test input, respectively. These
programs are compiled by GNU CC with the “–O2” option, and are executed on an Ultra
SPARC processor. It is clear from Figure 5.2 that the best cache-line size for each program
is very diverse. For example, the best cache-line size in Figure 5.2 (a) is equal to or larger
than 128 bytes, that in Figure 5.2 (b) is equal to or smaller than 32 bytes, and that in Figure
5.2 (c) is just 64 bytes. If programs do not have enough spatial locality, as shown in Figure
5.3. VARIABLE LINE-SIZE CACHE 87
5.2 (b) or (c), we will have the following problems:
1. A number of conflict misses will take place due to frequent evictions.
2. As a result, a lot of time and energy will be wasted by a number of main-memory
accesses.
3. Activating the wide on-chip bus and the DRAM array will also dissipate a lot of energy.
Employing a set-associative cache is a conventional approach to solving the first and second
problems, because it can improve the cache-hit rates. As increasing the cache associativity
makes access time longer, however, it might worsen the memory performance [30][96]. In
addition, we still have the third problem due to a fixed large cache-line size.
5.3 Variable Line-Size Cache
5.3.1 Terminology
In the VLS cache, an SRAM (cache) cell array and a DRAM (main memory) cell array are
divided into several subarrays. Data transfer for cache replacements is performed between
corresponding SRAM and DRAM subarrays. Figure 5.3 summarizes the definition of terms.
Subline, or address-block, is a block of data associated with a single tag in the cache. Line,
or transfer-block, is a block of data transferred at once between the cache and the main
memory. The sublines from every SRAM subarray, which have the same cache-index, form
a cache-sector. A cache-sector and a subline which are being accessed during a cache lookup
are called a reference-sector and a reference-subline, respectively. When a memory reference
is a cache hit, the desired data resides in the reference-subline. Otherwise, the desired data
is not in the reference-subline but only in the main memory. A memory-sector is a block of
data in the main-memory, and corresponds to the cache-sector. Adjacent-subline is defined
as follows:
1. It resides in the reference-sector, but is not the reference-subline. For the example
depicted in Figure 5.3, the sublines which include address 32, 2, and 3 satisfy this
condition.
88 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0 2 34
Transfer-BlockAddress-Block(Subline)
2
Adjacent-Subline
3
Reference-Subline
5
Index 0Index 1Index 2
Reference to 1Reference to 3
Memory-Sector
Reference-Sector
VLS cache
Main Memory
Cache-Sector
(Line)
SRAMSub-Array
0
SRAMSub-Array
1
SRAMSub-Array
2
SRAMSub-Array
3
DRAMSub-Array
0
DRAMSub-Array
1
DRAMSub-Array
2
DRAMSub-Array
3
32 33 34 35
32
36 37Memory Address
Figure 5.3: Terminology for VLS Caches.
2. The main-memory home-location of it is in the same memory-sector as that of the
data which is currently being referenced by the processor. For the example depicted in
Figure 5.3, the sublines which include address 2, and 3 satisfy this condition.
3. It has been referenced at least once since it was fetched into the cache. For the example
depicted in Figure 5.3, the subline which includes address 3 satisfies this condition.
5.3.2 Concept and Principle of Operations
To make good use of the high on-chip memory bandwidth, the VLS cache optimizes its line
size according to the characteristics of programs. When programs have rich spatial locality,
5.3. VARIABLE LINE-SIZE CACHE 89
Main Memory
Cache
0 1 2 3
(c) Replace with a Maxmum Line
128-byte Line
(b) Replace with a Medium LineMain Memory
Cache
0 1 2 3
64-byte Line
Legend
Main Memory
Cache
0 1 2 34
(a) Replace with a Minmum Line
Line32-byte
data transfer occurs
no data transfer occurs
Figure 5.4: Three Different Transfer-Block Sizes on Cache Replacements.
the VLS cache would determine to use larger lines, each of which consists of lots of sublines.
Conversely, the VLS cache would determine to use smaller lines, each of which consists of a
single or a few sublines, and could try to avoid cache conflicts. In addition, activating only
the DRAM subarrays corresponding to the small lines (i.e., the small number of sublilnes)
makes a significant energy reduction.
The construction of the direct-mapped VLS cache illustrated in Figure 5.4 is similar to that
of a conventional 4-way set-associative cache. However, the conventional 4-way set-associative
cache has four locations where a subline can be placed, while the direct-mapped VLS cache
has only one location for a subline, just like a conventional direct-mapped cache. Since the
90 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
VLS cache attempts to reduce conflict misses without increasing the cache associativity, the
fast access of a direct-mapped cache can be maintained.
For the VLS cache shown in Figure 5.3, there are three possible line sizes as follows:
• Minimum line size, where only the designated subline is involved in cache replacements
(see Figure 5.4 (a)).
• Medium line size, where the designated subline and one of its neighborhood in the
corresponding cache-sector are involved (see Figure 5.4 (b)).
• Maximum line size, where the designated subline and all of its neighborhood in the
corresponding cache-sector are involved (see Figure 5.4 (c)).is shorter than that of
conventional caches with higher associativity.
Since it is not be allowed that the medium line misaligns with the 64-byte boundary in the
128-byte cache-sector, the number of possible combinations of sublines to be involved in cache
replacements is just seven (four for minimum, two for medium, and one for maximum line
size, respectively) rather than fifteen (= 24 − 1).
5.3.3 Line Size Optimization
The effectiveness of the VLS cache depends heavily on how well the cache replacement is
performed with appropriate line size. We need to consider how and when the line size should
be changed. At least, there are the following three methods how to determine the appropriate
line sizes.
1. Static determination based on prior simulations : Application programs are analyzed
using a cache simulator in advance. We determine suitable line sizes based on the
results of the simulation using input sets.
2. Static determination based on compiler analysis : Source programs are analyzed by a
special compiler. Then the compiler determines suitable line sizes.
3. Dynamic determination using a hardware assist : Special hardware determines suitable
line sizes at run-time.
In addition, we can consider the following granularity for line size modification.
5.4. STATICALLY VARIABLE LINE-SIZE CACHE 91
• program by program : a program has an appropriate line size. Thus, the line size is
changed at context switches take place.
• procedure by procedure : each procedure has own appropriate line size. Therefore, the
line size is changed at procedure calls.
• code by code : each load (or store) instruction has own appropriate line size. The line
size is changed at load/store operations.
• data by data : each data located in main memory has own appropriate line size. The
line size depends on the memory reference address.
We will introduce two VLS caches. One adopts the static line-size determination based on
prior simulations, and changes the line size program by program (section 5.4). The other
employs the dynamic determination by a hardware assist, and optimizes the line size data
by data (section 5.5).
5.4 Statically Variable Line-Size Cache
5.4.1 Organization
The most straight way to determine an appropriate line size is to test various fixed line sizes
based on prior simulations. This method might be available if programs have independent
behavior of input data. We refer to this kind of VLS cache as Statically Variable Line-Size
Cache (S-VLS cache). Figure 5.5 illustrates the block diagram of a direct-mapped S-VLS
cache. As the subline size is 32 bytes, the S-VLS cache can provide 32-byte, 64-byte, and
128-byte line sizes.
The number and the size of the tag fields are equal to those of a conventional direct-mapped
cache with fixed 32-byte lines. A status register in the processor has a field in order to indicate
the current line size. The line-size field can be modified by a special instruction inserted in
the top of programs. Namely, a program is executed with a single line size which is specified
by the special instruction. When a task switch occurs, the line-size information is saved, and
restored, along with the machine state by means of a conventional context saving/restoring
sequence. Therefore, changing the line size does not incur any extra overhead of performance.
92 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Comparator
Tag
MUX
Processor
Main Memory
VLS Cache
Address
Load/Store Data
Status Register
Tag Index SA Offset
MUXHit / Miss?
Data
ProgramProgram
Program
Set the Line-Size Mode
Line-Size Mode
32Bytes
Special Instruction
32Bytes 32Bytes 32Bytes
SA: Subarray
Figure 5.5: A Direct-Mapped S-VLS Cache with 32-byte, 64-byte, and 128-byte Lines.
5.4.2 Operation
The S-VLS cache works as follows:
1. When a memory access takes place, the cache tag array is looked up in the same manner
as normal caches, except that every SRAM subarray has its own tag memory and the
lookup is performed on every tag memory.
2. On a cache hit, the hit subline has the required data, and the cache access is performed
on this subline in the same manner as normal caches.
3. On a cache miss, a cache refill takes place as follows:
5.5. DYNAMICALLY VARIABLE LINE-SIZE CACHE 93
(a) According to the designated line size, one or more sublines are written back from
the indexed cache-sector into their home locations in the DRAM main memory.
(b) According to the designated line size, one or more sublines (one of which contains
the required data) are fetched from the memory-sector into the cache-sector.
5.4.3 Line-Size Determination
In case that a direct-mapped S-VLS cache can provide 32-byte, 64-byte, and 128-byte lines,
for example, we can determine the suitable line size for a program in the following manner.
First, the program is simulated three times to measure hit rates assuming three direct-
mapped caches with fixed line size of 32 bytes, 64 bytes, and 128 bytes. Then we can regard
the suitable line size of the program as the line size which gives the highest hit rate of three
simulations.
5.5 Dynamically Variable Line-Size Cache
5.5.1 Organization
The statically variable line-size cache (S-VLS) explained in section 5.4 attempts to improve
cache-hit rates by exploiting the difference of spatial locality among programs. It may be pos-
sible to adopt the static method when target programs have regular access patterns within
well-structured loops. However, a number of programs have non-regular access patterns.
Therefore, the amount of spatial locality may vary both within and among program execu-
tions. Against to the static method, the dynamically variable line-size cache (D-VLS cache)
selects adequate line sizes based on recently observed data-reference behavior at run time.
The cache has some hardware components to optimize the line size.
Figure 5.6 illustrates the block diagram of a direct-mapped D-VLS cache having three line
sizes, 32 bytes, 64 bytes, and 128 bytes. The D-VLS cache has the following components for
optimizing the line size at run time:
• A reference-flag bit per subline : This flag bit is reset to 0 when the corresponding
subline is fetched into the cache, and is set to 1 when the subline is accessed by the processor.
Of course, the reference-flag bit corresponding to the subline which has caused the cache miss
94 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Comparator
reference-flag
Tag
MUX
Main Memory
D-VLS Cache
Address Load/Store Data
Tag Index
MUXHit / Miss?
Data
32Bytes32Bytes 32Bytes 32Bytes
Processor
Line-Size Determinator
(LSD)
current line-size
next line-size
SA Offset
SA : Subarray
for Detection of the reference-sector
for Detection of the adjacent-subline
LSS-table
Line-SizeSpecifier
(LSS)
Figure 5.6: Block Diagram of a Direct-Mapped D-VLS Cache.
is set to 1 when the cache replacement is performed. It is used for determining whether the
corresponding subline is an adjacent-subline. On cache lookup, if the tag of an subline
which is not the reference-subline matches the tag field of the memory address, and if the
reference-flag bit is 1, then the subline is an adjacent-subline.
• A Line-Size Specifier (LSS) per cache-sector : This specifies the line size of the
corresponding cache-sector. As described in section 5.3.2, each cache-sector is in one of three
states: minimum, medium, and maximum line-size states. To identify these states, every
LSS provides a 2-bit state information. This means that the cache replacement is performed
according to the line size which is specified by the LSS corresponding to the reference-sector.
The LSS is stored in the LSS-table, as shown in Figure 5.6.
• Line-Size Determiner (LSD) : On every cache lookup, the LSD determines the state
5.5. DYNAMICALLY VARIABLE LINE-SIZE CACHE 95
of the line-size specifier of the reference-sector. The detail of determination algorithm is
explained in section 5.5.3.
5.5.2 Operation
The D-VLS cache works as follows:
1. The memory address generated by the processor is divided into the byte offset within
a subline, subarray field designating the subarray, index field used for indexing the tag
memory, and tag field.
2. Each cache subarray has its own tag memory and comparator, and it can perform the
tag-memory lookup using the index and tag fields independently with each other. At
the same time, the LSS corresponding to the reference-sector is read from the LSS-table
using the index field.
3. One of the tag-comparison results is selected by the subarray field in the memory
address, and then the cache hit or miss is detected.
4. On a cache miss, a cache replacement is performed according to the state of the LSS.
5. Regardless of hits or misses, the LSD determines the state of the LSS. After that, the
LSD writes back the modified LSS to the LSS-table.
5.5.3 Line-Size Determination
The algorithm for determining adequate line sizes is very simple. This algorithm is based
not on memory-access history but on the current state of the reference-sector. This means
that no information for evicted data from the cache needs to be maintained. On every cache
lookup, the LSD determines the state of the LSS of the reference-sector, as follows:
1. The LSD investigates how many adjacent-sublines exist in the reference-sector using
all the reference-flag bits and the tag-comparison results.
2. Based on the above-mentioned investigation result and the current state of the LSS of
the reference-sector, the LSD determines the next state of the LSS. The state-transition
diagram is shown in Figure 5.7.
96 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Minimum Line Maximum Line
Medium Line
: reference-subline or adjacent-subline
Other PatternsOther Patterns
Other Patterns
Reference-Sector
Initial
Figure 5.7: State Transition Diagram.
If there are many neighboring adjacent-sublines, the reference-sector has rich spatial lo-
cality. This is because the data currently being accessed by the processor and the adjacent-
sublines are fetched from the same memory-sector, and these sublines have been accessed
by the processor recently. In this case, the line size should become larger. Thus the state
depicted in Figure 5.7 moves from the minimum state (32-byte line) to the medium state
(64-byte line) or from the medium (64-byte line) state to the maximum state (128-byte line)
when the reference-subline and adjacent-sublines construct a larger line size than the current
line size.
In contrast, if the reference-sector has been accessed sparsely before the current access,
there should be few adjacent-sublines in the reference-sector. This means that the reference-
sector has poor spatial locality at that time. In this case, the line size should become smaller.
So the state depicted in Figure 5.7 moves from the maximum state (128-byte line) to the
medium state (64-byte line) when the reference-subline and adjacent-sublines construct equal
or smaller line-size than the medium line-size (64-byte or 32-byte line). Similarly, the state
moves from the medium state (64-byte line) to the minimum state (32-byte line) when the
reference-subline and adjacent-sublines construct minimum line-size (32-byte line).
5.6. EVALUATIONS 97
5.6 Evaluations
In this section, we discuss the performance/energy efficiency of the VLS caches, S-VLS and D-
VLS. Before presenting the performance/energy improvements achieved by the VLS caches,
we consider the access time and access energy of the cache and main memory, respectively.
Then we show simulation results for cache-hit rates and cache-line sizes, and evaluate the
performance in term of the average memory-access time (AMAT ) and the energy in term of
the average memory-access energy (AMAE).
5.6.1 Simulation Environment
In this evaluation, we compare the VLS caches with some conventional caches. Each cache
model is represented as follows:
• Fix128 : Conventional 16 KB direct-mapped cache with fixed 128-byte line size.
• Fix128W2 : Conventional 16 KB two-way set-associative cache with fixed 128-byte line
size.
• Fix128W4 : Conventional 16 KB four-way set-associative cache with fixed 128-byte line
size.
• Fix128db : Conventional 32 KB direct-mapped cache with fixed 128-byte line size.
• SVLS128-32 : 16 KB direct-mapped S-VLS cache having three line sizes of 32 bytes,
64 bytes, and 128 bytes. The cache changes the line size program by program. The
adequate line size of each program is determined based on prior simulations.
• DVLS128-32 : 16 KB direct-mapped D-VLS cache having three line sizes of 32 bytes,
64 bytes, and 128 bytes. The line-size determiner optimize the line size at run-time.
For the cache-access time (TCache), we use the CACTI 2.0 model in Section 5.6.2. CACTI
estimates the cache-access time with the detail analysis of several components, for example,
sense amplifiers, output drivers, and so on [96] [42]. In addition, we calculate the cache-access
energy (ECache) based on Kamble’s model [48]. Then, we measure cache-miss rates using two
kind of cache simulators written in C: one for conventional caches with fixed 128-byte line size
and the other for the VLS caches with 32-byte, 64-byte, and 128-byte line sizes. The line size
98 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Table 5.1: Benchmark Programs.
Programs Inputs
SPECint92 026.compress, 072.sc ref
SPECfp92 052.alvinn ref
SPECint95 099.go, 124.m88ksim, 126.gcc, 130.li,
132.ijpeg, 134.perl, 147.vortex training
SPECfp95 101.tomcatv, 102.swim, 103.su2cor, 104.hydro2d test
MPEG2 encoder, decoder verification
Mix-Int1 124.m88ksim, 130.li, 147.vortex –
Mix-Int2 072.sc, 126.gcc, 134.perl –
Mix-Fp 052.alvinn, 101.tomcatv, 103.su2cor –
Mix-IntFp 132.ijpeg, 099.go, 104.hydro2d –
(i.e., the number of sublines involved in cache replacements) is also measured for the D-VLS
cache). In our experiments, eleven integer programs and five floating-point programs from the
SPEC92/95 benchmark suit [82] are used. We also simulate mpeg2encode and mpeg2decode
programs from [63] using verification pictures as media applications. Furthermore, to assume
more realistic execution on general purpose processors, four benchmark sets are used, Mix-
Int1, Mix-Int2, Mix-Fp, and Mix-IntFp. The programs in each benchmark set are assumed
to run in multiprogram manner on a uni-processor system, and a context switch occurs per
execution of one million instructions. Mix-Int1 and Mix-Int2 contain integer programs only,
andMix-Fp consists of three floating-point programs. Mix-IntFp is formed by two integer and
one floating-point programs. For each benchmark set, three billion instructions are executed.
All of the programs are compiled by GNU CC with the “–O2” option, and are executed on
an Ultra SPARC architecture. The address traces are captured by QPT [31].
5.6.2 Cache-Access Time
Cache-access time, or cache-hit time, is very sensitive to the cache organization. Figure
5.8 illustrates critical timing paths on conventional caches and the S-VLS cache. MatchOut
5.6. EVALUATIONS 99
Data
Conventioanl Direct-Mapped Cache
with 128-byte lines
(a) (c)Decoder
TagRead
Comparator
Tag0
MatchDriver DataDriver
DataRead TagRead
Reference Address Reference AddressDataOut
DataRead
Comparator
Tag0
TagRead
Reference Address
DataRead
Comparator
MuxDriver
DataDriver
Decoder Decoder
(b)
MatchDriver DataDriver
MuxDriver
OffsetAddress
MuxDriver
MatchDriver
128Bytes
Tag
32Bytes 32Bytes
TagSide-Path DataSide-Path
MatchOut DataOut
MatchOut
MatchOut DataOut
Tag3
Data0
Data3
Tag3
Data0
Data3
Conventioanl 4-way set-associative Cache
with 32-byte lines
Direct-Mapped Statically Variable Line-Size Cache
with 32,64,128-byte lines
MuxSidePath
Figure 5.8: Cache Critical Path.
and DataOut are outputs of the caches, both of which are driven by tri-state buffers. We
assume that the multiplexors to select a word data are implemented by the tri-state buffers.
The cache-access time consists of the delay for decoder, tag read, data read, comparators,
multiplexor drivers, and output drivers [96]. The cache-access time of the conventional direct-
mapped cache is determined by either the TagSide-path or the DataSide-path, while that of
the conventional set-associative cache is determined by the longer path of the MuxSide-path
and the DataSide-path, as shown in Figure 5.8 (a) and (b).
The structure of the S-VLS cache is similar to that of the conventional set-associative cache
having 32-byte small line size, as shown in Figure 5.8 (b) and (c). In the conventional set-
associative cache, the MuxSide-path often determines the cache-access time because control
signals for selecting a word data are made after tag comparison performed. However, this
critical path does not appear in the S-VLS cache because the control signals for the data
selection are made from the reference address directly. In the D-VLS cache, reference-flag,
LSS-table, and LSD for run-time line-size optimization are added to the S-VLS cache orga-
nization. As these components are not on the critical-path of the S-VLS cache, the D-VLS
cache also does not have extra overhead for the cache-access time.
Larger cache lines have two effects on the cache-access time. First, the delay for decoder
is reduced by the decreased number of cache lines in the SRAM array. Second, the delay
100 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Table 5.2: Cache Access Time.
Cache Access Time [s] Normalized Access Time [Tunit]
Fix128 1.12129e-09 1.000
Fix128W2 1.64826e-09 1.470
Fix128W4 2.11147e-09 1.883
Fix128db 1.34006e-09 1.195
SVLS128-32 1.12129e-09 1.000
DVLS128-32 1.12129e-09 1.000
for data drivers becomes longer because the number of drivers which share an output line is
increased and there is more loading at the output of each driver [96]. These features appear
not only on conventional caches but also on the VLS caches. Therefore, we can assume
that the DataSide-path delay of the VLS caches is the same as that of Fix128 which is the
conventional cache with the same cache size and the same associativity. On the other hand,
the TagSide-path of the VLS caches might be slightly longer than that of Fix128 because one
of four tag-comparison results has to be chosen for MatchOut signal. However, control signals
for this selection are made from the reference address directly. Thus, the TagSide-path of
the VLS caches is longer than that of Fix128 by only the delay for a single tri-state buffer.
We consider that there is hardly any bad influence of the latency caused by a tri-state buffer
on the cache-access time. Consequently, it is assumed that the cache-access time of the VLS
caches is the same as that of Fix128.
Table 5.2 shows the cache-access time based on the CACTI model [42]. It is assumed
that the process technology is 0.18 um. Here, we regard the cache- access time of 16 KB
conventional direct-mapped cache (Fix128) as Tunit. The dynamic line-size optimization
in the D-VLS cache requires two times of LSS-table accesses: one for reading and one for
writing. From cache-cycle-time point of view, the two accesses might make the cache-cycle
time longer. Because if the LSS is implemented by an SRAM array, it is very hard to complete
the two SRAM accesses in a processor clock cycle. There are two methods to resolve this
problem: one is the pipelining of the LSS-table accesses, and the other is to implement
the LSS-table using flip-flops. The latter method is employed in this evaluation, because
5.6. EVALUATIONS 101
the former method makes the structure and control for implementing the LSS-table more
complex.
5.6.3 Cache-Access Energy
To obtain the value of the capacitances, C in Equation (2.7), we refer to [49] which follows
the model given by [95]. We enumerate the energy dissipated in a conventional M-way set-
associative cache, in which the total number of sets (the total number of cache-sectors) and
the tag size are referred to as Nrow and T bits, respectively. The cache has a St-bits status
flag for each subline. The cache-access energy (ECache) can be approximated by the sum of
the energy dissipated in the bit lines (Ebit) and that in the word lines (Eword) [48]:
ECache ≈ ESRAMarray
≈ Ebit + Eword
Ebit = 0.5 ∗ V dd2 ∗ [Nbl, prch ∗ Cbl, prch + Nbl, w ∗ Cbl, w + Nbl, r ∗ Cbl, r +
m ∗ (8 ∗ L + T + St) ∗ (Cg, qpa + Cg, qpb + Cg, qp)]
Eword = V dd2 ∗ m ∗ (8 ∗ L + T + St) ∗ (2 ∗ Cg, q1 + Cwordwire)
Cbl, pr = Nrows ∗ (0.5 ∗ Cd, q1 + Cbitwire)
Cbl, w = Cbl, r = Nrows ∗ (0.5 ∗ Cd, q1 + Cbitwire) + Cd, qp + Cd, qpa
Nbl, pr = 0.5 ∗ (T ∗ M + St + 8 ∗ L ∗ M) ∗ 2
Nbl, r = 0.5 ∗ (T ∗ M + ST + 8 ∗ L ∗ M) ∗ 2
Nbl, w = 0.5 ∗ WPA ∗ (St + Wavg, data) ∗ 2
Here, we assume 3.3 volts power supply, and a 12V dd voltage swing on the bit-lines.
WPA(writeperaccess) denotes the write operation per cache access, and is assumed as 0.3.
Wavg, data is the average data width of a write request, and is assumed as 19 bits. We also
make the assumption that all signal values are independent and have a uniform switching
probability of 0.5. Nbl, prch, Nbl, w, Nbl, r are the total number of transitions, and Cbl, prch,
Cbl, w, Cbl, r are the load capacitances, in the bit lines due to precharging, writing, and read-
ing, respectively. Cg, X and Cd, X are the gate and drain capacitances of a transistor X,
respectively. The transistor qp, qpa, and qpb are used for bit-line precharging circuits, and q1
102 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Table 5.3: Cache Access Energy.
Cache Access Energy [fJ] Normalized Access Energy [Eunit]
Fix128 10,013,100 1.000
Fix128W2 11,611,419 1.160
Fix128W4 14,818,110 1.148
Fix128db 18,406,193 1.838
SVLS128-32 10,529,301 1.051
DVLS128-32 10,916,142 1.090
is the path gate for an SRAM cell. Cbitwire is the bit-line wire capacitance, and Cwordwire
is the word-line wire capacitance, per SRAM cell. We have referred to the various capaci-
tances as follows based on [49]: Cd, q1 =2.737 fF; Cg, q1 =0.401 fF; Cbitwire =4.4 fF/bitcell;
Cd, qp = Cd, qa = Cd, qb =80.89 fF; Cg, qp = Cg, qa = Cg, qb =38.08 fF; Cwordwire =1.8
fF/bitcell. These values are based on the 0.8 micron CMOS cache design described in [95].
Table 5.3 shows the cache-access energy (ECache) for each cache. The energy consumed
for write operations for cache refill is ignored. We regard the cache-access energy of the 16
KB conventional direct-mapped cache (Fix128) as Eunit. Increasing the cache associativity
consumes more energy, because it increases the total number of bit-lines, precharging circuits,
and so on. Similarly, increasing the cache size consumes more energy due to the increase in
the bit-line capacitance (i.e., increase in Nrow). Thus, the cache-access energy of Fix128W4
and Fix128db are larger than that of Fix128. On the other hand, the VLS caches do not
have this kind of energy overhead, because the cache size and associativity of Fix128 are
maintained. The caches consume more energy due to the extra tag comparison; the VLS
caches perform tag comparison at all subarrays as explained in Section 5.4.2 and in Section
5.5.2. However, the total number of bit-lines to be activated for the tag-memory accesses is
much smaller than that for the data-memory accesses. Therefore, the energy overhead for
the extra tag comparison is small. In addition, although the D-VLS cache needs to read
the 2-bit LSS and four 1-bit reference flags for run-time line-size optimization, this energy
overhead is also trivial.
5.6. EVALUATIONS 103
Table 5.4: Cache-Miss Rates.
Program Fix128 Fix128W2 Fix128W4 Fix128db SVLS DVLS
128-32 128-32
026.compress 0.1871 0.1755 0.1732 0.1634 0.1718 0.1724
072.sc 0.037 0.0285 0.0263 0.0276 0.0364 0.0465
052.alvinn 0.0224 0.0087 0.0080 0.0175 0.0224 0.0181
099.go 0.1024 0.0695 0.0302 0.0541 0.0571 0.0638
124.m88ksim 0.0202 0.0045 0.0028 0.0068 0.0167 0.0153
126.gcc 0.0611 0.0344 0.0254 0.0349 0.0535 0.0526
130.li 0.0341 0.0203 0.0182 0.0226 0.0341 0.0358
132.ijpeg 0.0244 0.0048 0.0036 0.0068 0.0195 0.0175
134.perl 0.0542 0.0230 0.0105 0.0295 0.0332 0.0286
147.vortex 0.050 0.0292 0.0195 0.030 0.036 0.0374
101.tomcatv 0.0633 0.0182 0.0062 0.0546 0.0633 0.0578
102.swim 0.2612 0.3007 0.3137 0.1016 0.1381 0.1419
103.su2cor 0.2600 0.0840 0.0242 0.2396 0.0887 0.0758
104.hydro2d 0.0481 0.0217 0.0179 0.0259 0.0481 0.0295
mpeg2encoder 0.0840 0.0033 0.0007 0.0326 0.0468 0.0476
mpeg2decoder 0.0265 0.0045 0.0036 0.0131 0.0105 0.0197
Mix-Int1 0.0348 0.0187 0.0145 0.0211 0.0278 0.0285
Mix-Int2 0.0515 0.0269 0.0192 0.0309 0.0384 0.0414
Mix-Fp 0.1119 0.0370 0.0132 0.1005 0.0468 0.0385
Mix-IntFp 0.0597 0.0327 0.0188 0.0311 0.0452 0.0377
5.6.4 Cache-Miss Rate
We have measured cache-miss rates for all benchmark programs using event-driven cache-
simulators. Table 5.4 shows simulation results. For some programs, the VLS caches can
achieve almost all the same or lower miss rates than the double-size conventional direct-
mapped cache (Fix128db). However, increasing associativity produces much better results.
For all programs except for 026.compress and 102.swim, the conventional four-way set-
104 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
052.alvinn101.tomcatv
102.swim103.su2cor
104.hydro2dmpeg2encode
mpeg2decode
Integer Programs Floating-Point Programs
0.60
0.70
0.80
0.90
1.001.10
1.20
1.30
1.40
1.50
No
rmal
ized
Mis
s R
ate
Integer Programs
026.compress072.sc
099.go124.m88ksim
126.gcc130.li
132.ijpeg134.perl
147.vortex
1.792 1.775 1.633
DVLS128-32FIX128FIX64FIX32
2.031 1.546 1.891 2.931 1.6381.795
0.60
0.70
0.80
0.90
1.001.10
1.20
1.30
1.40
1.50
No
rmal
ized
Mis
s R
ate
Figure 5.9: Miss Rates for Benchmarks.
associative cache (Fix128W4) achieves the lowest cache-miss rates of all caches.
To evaluate the accuracy of dynamic line-size optimization of the D-VLS cache, we have
executed the SPEC benchmark programs and MPEG2 programs on 16 KB conventional
direct-mapped caches, each of which has 32-byte lines (Fix32), 64-byte lines (Fix64), and
128-byte lines (Fix128). Figure 5.9 presents simulation results. The left three bars for
each benchmark are cache-miss rates produced by the conventional caches. The remaining
5.6. EVALUATIONS 105
1
2
3
4
10300 10350 10400 10450 10500 10550 10600 10650 10700 10750 10800
10300 10350 10400 10450 10500 10550 10600 10650 10700 10750 108001
2
3
4
Nu
mb
er o
f R
efer
ence
d 3
2-b
yte
Su
blin
es072.sc
104.hydro2d
Replace Sequence
10300 10350 10400 10450 10500 10550 10600 10650 10700 10750 10800
1
2
3
4
134.perl
Figure 5.10: Amount of Spatial Locality at a Cache-Sector.
bar to the right is the result of the D-VLS caches (DVLS128-32). For each benchmark,
simulation results are normalized to the cache-miss rate produced by the conventional cache
with the best line size. It is clear that the best line size is highly application-dependent.
In a number of program, however, the D-VLS cache gives nearly equal or lower miss rates
than the conventional cache with the best line size. In particular, for 132.ijpeg, 134.perl,
052.alvinn, and 104.hydro2d, the D-VLS cache has significant performance advantages over
the conventional caches. For the other programs but one (072.sc), the D-VLS cache produces
better results than the conventional cache with the second appropriate line size.
Although the D-VLS cache gives good results in almost all programs, it does not work
better for 072.sc. In order to clarify this cause, we have analyzed the transition of the
amount of spatial locality at the cache-sector which is the most frequently accessed by the
processor on Fix128. In this analysis, we have measured the number of 32-byte sublines
referenced by the processor in a 128-byte fixed-line during the 128-byte line resides in the
cache. We regard the number of the referenced 32-byte sublines as the amount of spatial
106 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
locality at the cache-sector. Figure 5.10 presents the simulation results; the horizontal axis
shows cache-replacement sequence, and the vertical axis shows the number of the referenced
32-byte sublines in the 128-byte fixed-line. It is clear that the amount of spatial locality
in 134.perl and 104.hydro2d are stable, whereas that in 072.sc frequently varies. On every
cache lookup, the line-size determiner (LSD) tries to detect the amount of spatial locality at
the reference-sector based on the number of adjacent-sublines. When the amount of spatial
locality of each cache-sector varies frequently, such as 072.sc, the LSD will lack the accuracy
for determining the adequate line size.
5.6.5 Main-Memory-Access Time and Energy
The main-memory-access time (TMainMemory) and energy (EMainMemory) depend on the mem-
ory size, organization, process technology, and so on. In this evaluation, we assume that the
main-memory-access time including the delay for data transfer between the cache and the
main memory (i.e., TDRAMarray + LineSizeBandwidth
) is ten times longer than the access time of the 16
KB direct-mapped conventional cache having 128-byte lines (i.e., TMainMemory = 10×Tunit).
For the main-memory-access energy, we assume that there is no energy dissipation for
DRAM refresh operations in order to simplify the evaluation. Thus, for the on-chip memory-
path architectures with a conventional cache, the main-memory-access energy (EMainMemory)
depends only on the total number of main-memory accesses. In other words, only cache-miss
rates affect the energy consumption. Since the VLS caches activate only the DRAM subarrays
corresponding to replaced sublines, the energy consumed for accessing to the on-chip main
memory depends not only on cache-miss rates but also on cache-line sizes (i.e., the number of
sublines to be involved in cache replacements). Accordingly, the main-memory-access energy
(EMainMemory) in Equation (2.4) can be expressed as follow:
EMainMemory = (EDRAMarray + EDataTransfer) × AverageLineSize
128bytes. (5.1)
Here, we assume that the average main-memory-access energy in conventional caches is ten
times larger than the cache-access energy of Fix128 (i.e., EDRAMarray + EDataTransfer = 10 ×Eunit). The right factor (AverageLineSize
128bytes) in Equation (5.1) denotes the average number of
activated 32-byte DRAM subarrays per cache-line replacement.
Table 5.5 shows the average line size on the S-VLS cache (SVLS128-32) and the D-VLS
cache (DVLS128-32). The table also reports the breakdown of cache-replace count for line
5.6. EVALUATIONS 107
Table 5.5: Average Line Size and Replace Count on VLS caches.
S-VLS D-VLS
Program Ave. Line Replace Count Ave. Line
Size [B] 32 [B] 64 [B] 128 [B] Size [B]
026.compress 32.00 3,164,502 243,979 14,498 34.69
072.sc 128.00 1,038,520 492,007 352,181 58.32
052.alvinn 128.00 11,546,415 1,465,880 18,806,730 90.22
099.go 32.00 6,445,160 1,724,674 389,746 42.82
124.m88ksim 64.00 317,746 53,858 68,353 50.83
126.gcc 64.00 10,092,540 3,463,487 1,468,861 48.76
130.li 128.00 1,190,072 426,488 189,570 49.63
132.ijpeg 64.00 3,530,649 1,179,064 1,246,695 58.43
134.perl 32.00 7,987,886 5,250,134 3,849,457 63.46
147.vortex 32.00 19,805,372 3,593,130 1,416,595 42.11
101.tomcatv 128.00 23,539,313 2,608,352 2,650,269 43.73
102.swim 32.00 32,465,163 4,163,613 884,142 37.81
103.su2cor 32.00 15,340,954 6,701,837 3,315,895 53.01
104.hydro2d 128.00 3,784,227 860,802 6,175,600 89.34
mpeg2encoder 32.00 1,764,231 79,783 10,182 33.90
mpeg2decoder 32.00 30,770 2,968 2,047 40.12
Mix-Int1 81.15 14,705,908 4,618,322 2,022,767 60.24
Mix-Int2 74.12 18,492,295 8,250,953 4,620,521 48.02
Mix-Fp 55.44 21,632,285 8,565,636 8,541,197 54.56
Mix-IntFp 82.60 17,005,515 4,564,846 7,577,526 61.97
sizes in the D-VLS cache. The average line size of the D-VLS cache depends on the char-
acteristics of memory-reference behavior in programs. It is observed that the D-VLS cache
attempts to use the small line size for 026.compress, and the average line size is 34.69 bytes.
In contrast, the cache choose aggressively the large line size for 052.alvinn in order to exploit
the rich spatial locality, and the average line size is 90.22 bytes.
108 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.000.100.200.300.400.500.600.700.800.901.001.101.201.30
052.alvinn 099.go 126.gcc 132.ijpeg 147.vortex026.compress 072.sc 124.m88ksim 130.li 134.perl
Fix128Fix128W2
SVLS128-32DVLS128-32
Fix128W4Fix128db
Benchmark Programs
102.swim 104.hydro2d mpeg2decode Mix-Int2 Mix-IntFp
101.tomcatv 103.su2cor mpeg2encode Mix-int1 Mix-Fp
0.000.100.200.300.400.500.600.700.800.901.001.101.201.30
Nor
mal
ized
Ene
rgy
Con
sum
ptio
nN
orm
aliz
ed E
nerg
y C
onsu
mpt
ion
Figure 5.11: Energy Consumed for On-Chip DRAM Accesses (CMR × 2 × EMainMemory).
Figure 5.11 depicts the energy consumption for accessing to the on-chip main memory.
All results are normalized to the conventional direct-mapped cache having 128-byte lines
(Fix128). As explained earlier, the energy consumption for conventional caches depends only
on cache-miss rates. Therefore, the conventional four-way set-associative cache (Fix128W4)
can achieve large energy reductions. For some programs, 052.alvinne, 130.li, 101.tomcatv, and
104.hydro2d, the S-VLS cache can not reduce any energy over Fix128, because the appropriate
line size is 128 bytes. Although the cache-miss rates of the VLS caches are higher than those
of Fix128W4, the caches make significant advantages of energy consumption by the selective
5.6. EVALUATIONS 109
activation of the on-chip DRAM subarrays. Actually, for a number of programs, the energy
reduction achieved by the VLS caches are comparable to that achieved by the Fix128W4.
5.6.6 Average Memory-Access Time
We have calculated the average memory-access time (AMAT ) as the performance based on
the cache-access time explained in Section 5.6.2, the cache-miss rates reported in Section
5.6.4, and the main-memory-access time defined in Section 5.6.5. Figure 5.12 depicts the
average memory-access time for each program in term of Tunit which is the access time of
Fix128. The upper dark-gray box of each bar is the delay for the cache replacement, which
is formulated by CMR × 2 × TMainMemory.
First, we compare the conventional caches. Increasing the cache associativity (Fix128W2,
Fix128W4) makes a significant improvements in cache-miss rates, as reported in Section
5.6.4. However, the improvement is negated by the longer cache-access time. As a result,
the conventional set-associative caches could not improve the average memory-access time
for many programs. On the other hand, the double-size conventional direct-mapped cache
(Fix128db) can achieve the higher performance than the Fix128 for many programs because
of the small access-time overhead.
Next, we discuss the performance improvements achieved by the VLS caches (SVLS128-32
and DVLS128-32). The VLS caches have no cache-access-time overhead, so that the cache-
miss rates improved by optimizing the line size appears on the average memory-access time
directly. The S-VLS cache changes the line size program by program. Thus, the performance
of the SVLS128-32 is the same as that of Fix128 for 052.alvinn, 072.sc, 130.li, 101.tomcatv,
and 104.hydro2d, the appropriate line size of which is 128 bytes. For other programs, the
S-VLS cache can improve the performance from Fix128. This can be understood easily
by considering how to determine the appropriate line size. The appropriate line size is
determined based on prior simulations assuming three direct-mapped caches with fixed 32-
byte, 64-byte, and 128-byte lines. Therefore, at least the cache-miss rates of the Fix128 is
guaranteed. On the other hand, when the dynamic line-size optimization in the D-VLS cache
lacks the accuracy, as 072.sc explained in Section 5.6.4, the cache worsen the performance.
However, most of the programs see the performance improvements from the dynamic line-
size optimization, with the exception of 072.sc. The performance improvements achieved by
110 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.000.501.001.502.002.503.003.504.004.505.005.506.00
052.alvinn 099.go 126.gcc 132.ijpeg 147.vortex026.compress 072.sc 124.m88ksim 130.li 134.perl
Ave
rage
Mem
ory
Acc
ess
Tim
e [T
unit]
Fix128Fix128W2Fix128W4
Fix128dbSVLS128-32DVLS128-32
TCache
Benchmark Programs
102.swim 104.hydro2d mpeg2decode Mix-Int2 Mix-IntFp
101.tomcatv 103.su2cor mpeg2encode Mix-int1 Mix-Fp
6.27.5
8.2
6.2
0.000.501.001.502.002.503.003.504.004.505.005.506.00
Ave
rage
Mem
ory
Acc
ess
Tim
e [T
unit]
CMR*2*TMainMemory
Figure 5.12: Average Memory-Access Time (AMAT ).
the VLS caches (S-VLS128-32 and D-VLS128-32) are comparable with that achieved by the
double-size conventional direct-mapped cache (Fix128db).
5.6.7 Average Memory-Access Energy
We have measured the average memory-access energy (AMAE) based on the cache-access
energy explained in Section 5.6.3, cache-miss rates reported in Section 5.6.4, and the main-
5.6. EVALUATIONS 111
052.alvinn 099.go 126.gcc 132.ijpeg 147.vortex026.compress 072.sc 124.m88ksim 130.li 134.perl
0.000.501.001.502.002.503.003.504.004.505.005.506.00
Fix128Fix128W2Fix128W4
Fix128dbSVLS128-32DVLS128-32
ECache
Benchmark Programs
102.swim 104.hydro2d mpeg2decode Mix-Int2 Mix-IntFp
101.tomcatv 103.su2cor mpeg2encode Mix-int1 Mix-Fp
6.27.2 7.8
6.26.6
0.000.501.001.502.002.503.003.504.004.505.005.506.00
Ave
rage
Mem
ory
Acc
ess
Ene
rgy
[Eun
it]
CMR*2*EMainMemory
Ave
rage
Mem
ory
Acc
ess
Ene
rgy
[Eun
it]
Figure 5.13: Average Memory-Access Energy (AMAE).
memory-access energy evaluated in Section 5.6.5. Figure 5.13 depicts the average memory-
access energy for benchmark programs in term of Eunit, which is the access energy of Fix128.
The upper dark-gray box of each bar is the energy consumed for the cache replacement, which
is formulated by CMR × 2 × EMainMemory.
As explained in Section 5.6.5, the conventional four-way set-associative cache (Fix128W4)
can reduce a lot of energy consumed for main-memory accesses due to achieving much lower
cache-miss rates. However, Fix128W4 consumes more total energy than the 16 KB conven-
112 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
3200*10
3400*10
3600*10
3800*10
4000*10
0.00
200*10
400*10
600*10
800*10
1000*10
1200*10
1400*10
1600*10
1800*10
2000*10
Mix-Int1 Mix-Int2 Mix-Fp Mix-IntFp
Fix128Fix128W2Fix128W4
Fix128dbSVLS128-32DVLS128-32
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
Tot
al E
nerg
y D
issi
pate
d in
Mem
ory
Sys
tem
s [E
unit]
(AM
AE
* T
otal
# o
f Mem
ory
Ref
eren
ces)
32.0
34.0
36.0
38.0
40.0
0.00
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
Benchmark sets
Total E
nergy where E
unit is 10.0 nJ [Joule]
Figure 5.14: Total Energy Dissipated in Memory Systems.
tional direct-mapped cache (Fix128) for some programs. Because the cache-access-energy
overhead for increasing the associativity is more than the reduction of the main-memory-
access energy achieved by improving the cache-miss rates. Similarly, since the double-size
conventional direct-mapped cache (Fix128db) consumes much energy, it is not efficient.
On the other hand, the energy overhead for changing the line size in the VLS caches
(SVLS128-32 and DVLS128-32) is trivial. Therefore, the cache-miss improvement and the
5.6. EVALUATIONS 113
DRAM-subarray subbanking based on small line sizes in the VLS caches contribute to the
total energy reduction. In the best case of 103.su2cor, the VLS caches achieve more than 70
% reduction of the average memory-access energy. Figure 5.14 depicts total energy dissipated
in memory systems, i.e., average memory-access energy × total number of memory references,
for each benchmark set.
5.6.8 Energy–Delay Product
To evaluate the performance and the energy at the same time, we have calculated the energy-
delay products (AMAE × AMAT ) based on Section 5.6.6 and Section 5.6.7. Figure 5.15
shows the results. For each program, all results are normalized to Fix128.
In conventional caches, the performance improvement achieved by increasing the cache size
(Fix128db) is negated by the more energy consumption. Contrarily, energy improvement
produced by increasing the cache associativity (Fix128W2 and Fix128W4) is negated by the
low-performance caused by the long cache-access time. The VLS caches do not have this
kind of negations because they can produce both the performance and energy improvements.
For the benchmark set Mix-IntFp, the highest performance of conventional caches is given
by the double-size direct-mapped cache (Fix128db), whereas the two-way set-associative
cache (Fix128W2) is the most efficient for energy consumption. However, the ED product
reductions achieved by Fix128db and Fix128W2 are only from 8 % to 20 %, compared with
the 16 KB conventional direct-mapped cache (Fix128). On the other hand, the S-VLS cache
(SVLS128-32) and the D-VLS cache (DVLS128-32) can reduce the ED product by 35 %
and 47 %, respectively. For most of the benchmarks, the S-VLS cache or the D-VLS cache
can make the most significant ED product reduction by optimizing the line size, with the
exception of 101.tomcatv and mpeg2encode.
For many programs, the D-VLS cache brings better results than the S-VLS cache. This
reason can be understood by considering the frequency of line size modification. Since the
amount of spatial locality will vary both within and among program executions, the appro-
priate line size is not a constant. The S-VLS cache attempts to optimize the line size program
by program, while the D-VLS cache modify it data by data. Therefore, the D-VLS cache can
adapt to the change of spatial locality inherent in programs.
114 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.7
052.alvinn 099.go 126.gcc 132.ijpeg 147.vortex026.compress 072.sc 124.m88ksim 130.li 134.perl
Nor
mal
ized
ED
Pro
duct
[AM
AE
* A
MA
T]
Fix128Fix128W2Fix128W4
Fix128dbSVLS128-32DVLS128-32
Benchmark Programs
102.swim 104.hydro2d mpeg2decode Mix-Int2 Mix-IntFp
101.tomcatv 103.su2cor mpeg2encode Mix-int1 Mix-Fp
0.00.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.7
Nor
mal
ized
ED
Pro
duct
[AM
AE
* A
MA
T]
Figure 5.15: Energy–Delay Product.
5.6.9 Hardware Cost
Generally, a cache consists of an SRAM portion (data-array and tag-array) and logic
portions (decoder, comparator, and multiplexors). Additionally, the D-VLS cache requires
the special hardware components, the reference-flag bits, the LSS-table and the LSD. We have
calculated the size of the SRAM portion and have designed the logic portions in order to find
the number of transistors for each cache. In this design, we have described the logic portions
5.6. EVALUATIONS 115
Table 5.6: Hardware Costs.
Cache Model SRAM portion Logic portion Total
Data Tag Total Logic LSD LSS Total
[bits] [bits] [bits] [Tr] [Tr] [Tr] [Tr] [Tr]
Fix128 131,072 2,304 133,376 17,988 – – 17,988 84,676
Fix32W4 131,072 10,240 141,312 18,968 – – 18,968 89,624
SVLS128-32 131,072 9,216 140,288 18,922 – – 18,922 89,066
DVLS128-32 131,072 9,728 140,800 18,922 230 14,020 33,172 103,572
in RT-Level using VHDL (VHSIC Hardware Description language), and have translated that
to a Gate-Level description using Synopsys VHDL Compiler.
For the D-VLS cache (DVLS128-32), each tag includes the 1-bit reference-flag. The LSS-
table is implemented by flip-flops in order to keep the cache-cycle time, as explained earlier
in Section 5.6.2. Since the 16 KB D-VLS cache with 32-byte, 64-byte, and 128-byte lines has
128 cache-sectors (= 16KB / 128bytes), DVLS128-32 requires 256 (= 2bits×128) flip-flops
for the LSS-table. We can implement the LSD with small combinational logic due to the
simple algorithm for determining the adequate line sizes. Table 5.6 shows the size of the
SRAM portion and the number of transistors for the logic portions. The right-most column
describes the total number of transistors including the SRAM portion where a 2-bit SRAM
is assumed to be one transistor. This assumption comes from a load-map [79] which shows
that the rate of the LogicTransistors/cm2 to CacheSRAMBits/cm2 from 2001 to 2007 is
approximately 1:2. The column described as “LSS” includes both flip-flops for the LSS-table
and multiplexors for selecting the LSS corresponding to the reference-sector.
The construction of the direct-mapped S-VLS cache (SVLS128-32) is similar to that of a
conventional four-way set-associative cache having 32-byte lines, with the exception of the
tag size. The tag size is affected by the cache associativity, but not by the cache line size.
It is observed that the hardware overhead of the S-VLS cache and the D-VLS cache over
Fix128 are 5 % and 22 %, respectively. Although the D-VLS cache requires more transistors
than the conventional caches, this hardware overhead is trivial for the area of the entire chip
of merged DRAM/logic LSIs which have not only the on-chip cache but also a large on-chip
116 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Fix32DVLS128-32
Fix128
Conflict Miss
Capacity Miss and
Conflict MissCapacity Miss
Compulsory Miss
Compulsory Miss
Mis
s R
ate
per
Mis
s T
ype
(%)
16 KB Caches 128 KB Caches
Fix32DVLS128-32
Fix128
0.000.010.020.030.040.050.060.070.080.090.100.110.120.130.140.150.16
4 8 16 32 64 128
Mis
s R
ate
(%)
Cache Size [KB]
Fix32Fix128
DVLS128-32
(A) Miss Rates with Various Cache Size (B) Breakdown of Miss Rates
Figure 5.16: The Effect of Cache Size.
main memory.
5.6.10 Effects of Other Parameters
It is important to analyze the proposed cache architecture under various conditions. In this
section, we evaluate the effectiveness of the dynamically variable line-size (D-VLS) cache in
detail: the effect of cache size, on-chip main-memory-access time and energy, and the size of
LSS-table. The 16 KB direct mapped D-VLS cache having 32-byte, 64-byte, and 128-byte
lines is compared with the three conventional caches having 128-byte fixed line size: 16 KB
direct-mapped cache (Fix128), 16 KB four-way set-associative cache (Fix128W4), and 32 KB
direct-mapped cache (Fix128db). The four benchmark sets, Mix-Int1, Mix-Int2, Mix-Fp, and
Mix-IntFp, are used in this analysis.
5.6.10.1 Cache Size
In order to investigate the effect of cache size on the D-VLS cache performance, we have
simulated the conventional caches and the D-VLS cache varying the cache sizes from 4 KB
to 128 KB. Figure 5.16 (A) presents the average cache-miss rates of four benchmark sets.
DVLS128-32 is superior to the other conventional caches even though the cache size is varied
from 4 KB to 128 KB. Where the cache size exceeds 64 KB, however, the D-VLS cache does
5.6. EVALUATIONS 117
not make a significant improvement.
Figure 5.16 (B) shows the breakdown of the cache-miss rates for Mix-IntFp benchmark set.
From the figure, it is clear that increasing the cache size reduces the conflict misses even if
the fixed large-line size is employed. When the cache size is very small, the total number
of large lines in the cache is very few. In this case, the negative effect of frequent evictions
caused by large lines exceeds the positive effect of prefetching. In contrast, increasing cache
size increases the total number of large lines in the cache. As a result, the conflict misses
can be reduced even if programs do not have enough spatial locality. The D-VLS cache
attempts to improve the performance by reducing the conflict misses. Where the cache has
enough capacity for the working-set of programs, the conventional cache already can avoid
the frequent evictions. Therefore, the effectiveness of the D-VLS cache is degraded with
increase in the cache size.
Although the trend has been certainly increasing the on-chip cache size, the working-set of
target application programs has been also growing. Hence, we believe that the D-VLS cache
will produce the large performance improvements even if the cache size is increased.
5.6.10.2 On-Chip Main-Memory-Access Time and Energy
In merged DRAM/logic LSIs, the on-chip main memory will occupy a large area of the
whole chip. The main-memory-access time (TMainMemory) and energy (EMainMemory) depend
on the on-chip DRAM size, process technology, and so on. Therefore, it is very important to
consider the effect of the on-chip main-memory performance and energy on the total memory-
system performance and energy. To evaluate the availability of the D-VLS cache, we have
simulated the conventional caches and the D-VLS cache under various conditions.
Figure 5.17 shows the average memory-access time (AMAT ) when the main-memory-
access time (TMainMemory) is changed from 2Tunit to 22Tunit, where Tunit is the cache-
access time of Fix128. The conventional cache with higher associativity (Fix128W4) or
larger size (Fix128db) produces lower cache-miss rates than the D-VLS cache, as reported
in Section 5.6.4. Therefore, the conventional caches have higher performance than the D-
VLS cache when the main-memory-access time is increased. Because increasing the cache
associativity or cache size in the conventional cache decreases the main-memory-access time
(TMainMemory) by improving the cache-miss rate more than they increase the cache-access
118 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.000.501.001.502.002.503.003.504.004.505.005.506.006.50
2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Time [Tunit]
Ave
rage
Mem
ory
Acc
ess
Tim
e [T
unit]
Fix128Fix128W4
Fix128dbDVLS128-32
TCache
CMR*2*TMainMemory
Mix-Int1 Mix-Int2
2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Time [Tunit]
0.000.501.001.502.002.503.003.504.004.505.005.506.006.50
Ave
rage
Mem
ory
Acc
ess
Tim
e [T
unit]
Mix-FpMix-IntFp
Figure 5.17: The Effect of Main-Memory-Access Time.
time (TCache). however, the performance efficiency of the D-VLS cache is still comparable to
the conventional caches even if the main-memory-access time is 22Tunit.
Figure 5.18 depicts the average memory-access energy (AMAE) when the main-memory-
access energy (EMainMemory) is changed from 2Eunit to 22Eunit, where Eunit is the cache-
access energy of Fix128. Where the main-memory-access energy exceeds 10Eunit, the set-
associative cache (Fix128W4) is superior to the other conventional caches because of the
lowest cache-miss rates. The cache-miss improvement reduces the total number of main-
memory accesses, so that the total energy is reduced. This trend is much clear where the main-
memory-access energy is increased. The cache-miss rates of the D-VLS cache (DVLS128-32)
5.6. EVALUATIONS 119
0.000.501.001.502.002.503.003.504.004.505.005.506.006.50
2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Energy [Eunit]
Ave
rage
Mem
ory
Acc
ess
Ene
rgy
[Eun
it]
2 6 10 14 18 22 2 6 10 14 18 22Main Memory Access Energy [Eunit]
0.000.501.001.502.002.503.003.504.004.505.005.506.006.50
Ave
rage
Mem
ory
Acc
ess
Ene
rgy
[Eun
it]
Mix-Fp
Mix-IntFp
Mix-Int1
Fix128Fix128W4
Fix128dbDVLS128-32
ECache
CMR*2*EMainMemory
Mix-Int2
Figure 5.18: The Effect of Main-Memory-Access Energy.
are higher than those of the set-associative cache (Fix128W4). However, the D-VLS cache
has two way to reduce the main-memory-access energy: one is to improve the cache-miss rates
and the other is to obtain the DRAM subbanking effect based on the optimized line size.
Thus, the D-VLS cache can make the most significant energy reduction for all benchmark
sets even if the main-memory-access energy is increased.
5.6.10.3 LSS-Table Size
Thus far, we have assumed that each cache-sector in the 16 KB D-VLS cache has own
line-size specifier (LSS). Namely, the LSS-table has the same number of entries as the total
120 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
0.020
0.035
0.030
0.045
0.040
0.045
0.050
0.055
0.060
memorysector
1(128) 2(64) 4(32) 8(16) 16(8) 32(4) 64(2) 128(1)
Cac
he M
iss
Rat
e
Mix-Int1Mix-Int2
Mix-FpMix-IntFp
The total number of cache-sectors shareing a LSS(The total number of entries in the LSS-table)
Figure 5.19: The Effect of The LSS-table Size.
number of cache-sectors in the cache. To evaluate the accuracy of the cache-sector-based run-
time line-size optimization, we compare it with a memory-sector-based run-time optimization
which ignoring the hardware cost. In addition, we simulate the benchmark sets with various
granularity of specifying the line size in order to find the effect of the LSS-table size on the
D-VLS cache performance. Sharing the LSS by many cache-sectors reduces the hardware
cost of the D-VLS cache, as reported in Section 5.6.9.
Figure 5.19 depicts the cache-miss rates for benchmark sets. The horizontal axis shows
the total number of cache-sectors sharing a LSS. For example, “8(16)” represents the eight
cache-sectors share a LSS, so that the total number of entry in the LSS-table is sixteen (the
total number of cache-sectors in the 16 KB D-VLS cache is 128: 16×1024128
). The right-most
plot denoted as “memory-sector” means that D-VLS cache has a LSS for each memory-sector
rather than for each cache-sector. This is an ideal D-VLS cache ignoring the hardware cost
5.7. RELATED WORK 121
for implementing the LSS-table.
First, we compare the cache-sector-based realistic D-VLS cache denoted as “1(128)” with
the memory-sector-based ideal D-VLS cache denoted as “memory-sector” in the figure. In all
but one (Mix-Int2) benchmark set, The difference of the improvements given by the realistic
model and the ideal model is small. This means that the line-size determiner can select
the adequate line sizes even if it does not accurately track the amount of spatial locality of
individual memory-sectors.
Next, we discuss the effect of reducing the LSS-table size on the D-VLS cache performance.
The cache-miss rate of the 16 KB conventional cache having fixed 128-byte lines (Fix128)
for Mix-Int1, Mix-Int2, Mix-Fp, and Mix-IntFp are 0.0348, 0.0515, 0.1119, and 0.0597, re-
spectively, as reported in Section 5.6.4. Although the trend of decreasing the LSS-table size
(i.e., increasing the number of cache-sectors sharing a LSS) increases the cache-miss rate, the
D-VLS cache still can achieve better results. For Mix-Int1 and Mix-Int2, sharing a LSS by a
few LSS improves the cache-miss rates. For example, the cache-miss rates given by “4(32)”,
in which four cache-sectors share a LSS, are lower than those given by the completely cache-
sector-based one denoted “1(128)”. This can be understood by considering the behavior of
the line-size determiner (LSD) explained in Section 5.5.3. The LSD updates the state of the
LSS corresponding to the reference-sector which is a cache-sector accessed by the processor.
Therefore, the LSS is updated frequently where it is shared by some cache-sectors. As a
result, the LSS might be able to converge rapidly to an appropriate line size.
5.7 Related Work
Saulsbury et al.[78] and Wilson et al.[94] discussed cache architectures having large cache-line
size (512 bytes) with high on-chip memory bandwidth. They tried to avoid frequent cache
conflicts, occurred by the large cache lines, by increasing cache associativity. As the D-VLS
cache resolves the conflict problem using variable cache-line size, first access of direct mapped
cache can be maintained.
Several studies have proposed coherent caches in order to produce the performance im-
provement of shared-memory multiprocessor systems [18], [20]. The cache proposed in [20]
can adjust the amount of data stored in a cache line, and aims to produce fewer invalidations
of shared data and reduce bus or network transactions. On the other hand, the VLS cache
122 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
aims at improving the system performance of merged DRAM/logic LSIs by partitioning a
large cache line into multiple independently small cache sublines, and adjusting the number
of sublines to be enrolled on cache replacements. The fixed and adaptive sequential prefetch-
ing proposed in [18] allows us to fetch more than one consecutive cache lines. This approach
needs a counter for indicating the number of lines to be fetched. Regardless of the values
of memory reference addresses, the counter is always used for fetching cache lines on read
misses. On the other hand, the D-VLS cache has several flags indicating the cache-line size.
Which flag should be used depends on memory reference addresses. In other words, the
D-VLS cache can change the cache-line size not only along the advance of program execution
but also across data located in different memory addresses.
Excellent cache architectures exploiting spatial locality have been proposed in [23], [57]
and [41]. The caches presented in [41] and [57] need tables for recording the memory access
history of not only cached data but also evicted data from the cache. Similarly, the cache
presented in [23] uses a table for storing the situations of past load/store operations. In
addition, the detection of spatial locality in [23] relies on the memory access behavior derived
from constant-stride vector accesses. On the other hand, the D-VLS cache determines a
suitable cache-line size based on only the state of the cache line which is currently being
accessed by the processor. Consequently, the D-VLS cache has no large tables for storing the
memory access history. Just a single bit is added to each cache-tag for storing the memory
access history.
Furthermore, the above studies have focused on only performance. Our VLS cache attempts
to achieve not only high-performance but also low-power consumption by making good use
of the high on-chip memory bandwidth available on merged DRAM/logic LSIs.
5.8 Conclusions
In this chapter, we have described the variable line-size cache (VLS cache), which is a novel
cache architecture suitable for merged DRAM/logic LSIs. The purpose of the VLS cache is to
make good use of the attainable high on-chip memory bandwidth. The VLS cache attempts
to alleviate the negative effects of large cache line size by changing the cache line size. As the
line size modification does not require any access time overhead, the VLS cache can improve
the memory performance. Moreover, Activating only the DRAM subarrays corresponding to
5.8. CONCLUSIONS 123
the replaced line size makes a significant energy reduction.
We have proposed two VLS caches: the statically variable line-size cache (S-VLS cache)
and the dynamically variable line-size cache (D-VLS cache). The S-VLS cache determines
an appropriate line size based on prior simulations. The D-VLS cache try to optimize the
cache line size using a hardware assist. The line-size determiner detects the varying amount
of spatial locality within and among programs based on recently observed data reference
behavior at run time.
To evaluate the performance/energy efficiency of the VLS caches, we have simulated many
benchmark programs on the VLS caches and on conventional caches. As a result, it is observed
that the VLS caches can make a significant performance/energy improvement. In addition,
we have designed the caches to evaluate the hardware overhead. In the simulation results
for Mix-IntFp benchmark set which includes two integer and one floating-point programs, it
is observed that a S-VLS cache and a D-VLS cache reduce the energy-delay product by 35
% and 47 %, whereas the hardware overhead of those are only 5 % and 22 %, respectively,
compared with a conventional cache having the same cache size and the same associativity.
The D-VLS cache is more promising. Since the D-VLS cache does not require any modifi-
cation of instruction set architectures, the full compatibility of existing object codes can be
kept. In addition, the cache is adaptive to the varying amount of spatial locality within and
among programs. Therefore, we have analyzed the D-VLS cache in detail: the effect of cache
size, on-chip main-memory-access time and energy, and the size of LSS-table.
Employing merged DRAM/logic LSIs is one of the most important approaches for future
computer systems, because it can achieve high-performance/low-power by eliminating the
chip boundaries between processors and main memory. It is possible to obtain more perfor-
mance/energy improvements by exploiting the attainable high on-chip memory bandwidth
effectively. Since the VLS cache is applicable to any merged DRAM/logic LSIs, we believe
that the cache management using variable line-size is a very useful approach to improving
the performance/energy efficiency.
124 CHAPTER 5. VARIABLE LINE-SIZE CACHE ARCHITECTURE
Chapter 6
Conclusions
Mobile market is likely to continue to grow in the future. One of uncompromising require-
ments from portable computing is energy efficiency, because that affects directly the battery
life. On the other hand, portable computing will target more demanding applications, for
example moving pictures, so that higher performance of which is also required.
Cache memories have been employed as one of the most important components of com-
puter systems, because memory accesses are confined in on-chip. Reducing the frequency of
off-chip memory accesses produces significant advantages: reducing memory-access latency
and reducing I/O driving energy. In order to achieve higher performance, designers have
invested the increasing transistor budget in the cache memories (increasing cache capacity).
However, increasing the cache capacity makes cache-access time and energy larger. Since
memory references have locality, temporal and spatial locality, memory accesses concentrate
on the cache memory. Therefore, the performance/energy efficiency of cache memories af-
fects strongly the total system performance and energy dissipation. This fact suggests that
we need to keep considering to develop high-performance, low-energy cache memories.
In this thesis, we have proposed the following three cache architectures for high performance
and low energy dissipation.
• Way-predicting set-associative cache: A history table in the cache records MRU infor-
mation of each set. When a cache access is issued, only the MRU way is activated.
If the way prediction is correct, there is no activation in the remaining ways, thereby
saving the energy. Namely, the way-predicting set-associative cache attempts to elimi-
nate unnecessary way activation in set-associative caches. It has been observed in our
125
126 CHAPTER 6. CONCLUSIONS
evaluation that a way-predicting set-associative cache makes more than 70 % of cache-
access-energy reduction (ECache in Equation (2.4)), while it leads to only less than 10 %
of cache-access-time overhead (TCache in Equation (2.1)), compared with a conventional
set-associative cache.
• History-based tag-comparison cache: A history table implemented in a BTB (Branch
Target Buffer) records execution footprints of each instruction block. The corresponding
footprint is left at the first execution of the instruction block. When the instruction
block is executed again, the corresponding footprint is tested. If the footprint is detected
in current execution, the tag comparison in cache accesses for the instruction block can
be omitted. The execution footprints are left until a cache miss takes place. Namely, the
history-based tag-comparison cache attempts to eliminate unnecessary tag comparison
for reducing energy dissipation. It has been observed in our evaluation that a history-
based tag-comparison cache makes more than 99 % of tag-comparison-energy reduction
for a program (107.mgrid), compared with a conventional cache.
• Dynamically variable line-size cache: A history table implemented as reference flags
records recently observed memory-access patterns. The dynamically variable line-size
cache adjusts the cache-line size according to the amount of spatial locality at run-
time. If rich spatial locality is observed, the cache increases the cache-line size in order
to obtain the effect of prefetching. Otherwise the cache decreases the cache-line size
for avoiding conflict misses. Namely, the dynamically variable line-size cache attempts
to eliminate unnecessary data replacement and bandwidth utilization. It has been
observed in our evaluation that a dynamically variable line-size cache improves energy-
delay product (AMAT ×AMAE, i.e., Equation (2.1) × Equation (2.4)) by more than
45 % for a program (MixIntFp), compared with a conventional organization.
Our caches attempt to improve performance/energy efficiency by eliminating unnecessary
operations at run-time. Dynamic measurement makes it possible to adapt the caches to the
characteristics of programs. Although we have discussed individually the cache architectures,
it is also possible to combine them, for example the combination of a way-predicting set-
associative cache and a set-associative dynamically variable line-size cache, the combination
of a history-based tag-comparison cache and a direct-mapped dynamically variable line-size
127
cache. Therefore, we conclude that our cache architectures are promissing for improving the
performance/energy efficiency of memory systems in future processor systems.
We believe that more space in future processor chips will be invested for the cache memories
(not only level-1 but also level-2, -3, and so on). Thus, the cache memories will be an
important component in processor chips. The followings are our future challenges.
• The most effective approach to reducing energy dissipation is to reduce the supply
voltage (V dd in Equation (2.7)). However, low supply voltage will produce a leakage
power which is consumed at whole cache memories [51]. Challenging to reduce the
leakage power consumption is an attractive problem.
• We believe that behavioral approaches to improve performance/energy efficiency ex-
plained in Section 2.5.1.2 is more promising. However, clever cache control will com-
plex logic verification. The maturity of verification techniques, for example formal
verification techniques, for cache memories is very important.
• Increasing cache area may result in undesirable effects of reducing manufacturing yield.
Although addition of redundancy circuits (and memory cells) increases manufacturing
yield, it also leads to a performance degradation (i.e., cache-access time will become
longer). The cache-access time affects directly the memory-access latency as shown in
Equation (2.1). Thus, fault-tolerant techniques suitable for high-speed cache memories
are very important.
• In the future social environment in world-wide network systems, one of the most serious
problems is the security of information, for example credit-card number, phone number,
other personal data, and so on. This kind of information will be memorized and treated
via memory systems (disk, main memory, cache memory, and so on). Accordingly, next
challenge is to develop high security memory systems for the society in twenty-first
century.
128 CHAPTER 6. CONCLUSIONS
Acknowledgment
I would like to express my sincere appreciation to my advisor, Professor Hiroto Yasuura, for
his insight, advice, and support during my studies. My future career will benefit greatly from
his guidance.
I wish to acknowledge valuable discussions with Professor Kazuaki Murakami. His stern
evaluation of my work enhanced the quality of my research. I would like to express my
gratitude to Professor Toshinori Sueyoshi. I learned the attitude as a researcher under his
guidance. I would like to thank Professor Itsujiro Arita for giving me an opportunity to work
in computer science. I also would like to thank Professor Mizuho Iwaihara and Mr. Sunao
Sawada for discussing in laboratory seminar.
I am very grateful to Professor Makoto Amamiya and Professor Kazuaki Murakami for
serving on committee members of this thesis and providing thoughtful suggestions.
I would like to acknowledge numerous, past and present, people at Kyushu Institute of
Technology for supporting me. In particular, I would like to thank Professor Morihiro Kuga,
Mr. Koichiro Tanaka, Mr. Hidetomo Shibamura, Dr. Masaru Okumura, Mr. Masahide
Ouchi, and Mr. Munehiro Iida for sharing so much technical knowledge. I would like to
thank my past and present colleagues of Kyushu University, Dr. Hiroyuki Tomiyama, Mr.
Hiroshi Miyajima, Mr. Kenjiro Ike, Dr. Kei Hirose, Dr. Tohru Ishihara, Dr. Akihiko Inoue,
Mr. Eko Fajar Nurprasetyo, Mr. Makoto Sugihara, Mr. Koji Hashimoto, Mr. Katsuhiko
Metsugi, Mr. Takanori Okuma, Ms. Yun Cao, and other members of laboratory, for giving
me helpful suggestions. I also thank Ms. Kazumi Matsuoka, Ms. Kaori Kuga, Ms. Noriko
Usuki, Ms. Kyoko Matsuda, Ms. Kyoko Kubota, Ms. Rika Shudo, and Ms. Naoko Taketomi
for helping my activity.
I am grateful to Dr. Seiki Ogura, Dr. Yutaka Hayashi, Mr. Seitoku Ogura, Ms. Tomoko
Ogura, and Ms. Betty Bhudhikanok of Halo LSI Design & Device Technology, Inc. for giving
129
130 ACKNOWLEDGMENT
me so much knowledge for circuit and layout design. In particular, I would like to express
my sincere appreciation to Dr. Seiki Ogura. He gave me an opportunity to work for the new
company which is in forefront of VLSI technology. I have studied a lot of things from the
experience in the company. I also would like to thank Ms. Ichie Ogura for helping my life in
the USA. I am also grateful to Mr. Makoto Kojima of Matsushita Electric Industrial Corp.
for giving me so much worthwhile knowledge for circuit design.
I would like to thank the past and present members of the ISIT/KYUSHU (Institute of
Systems & Information Technologies / KYUSHU). Special thanks are due to Dr. Hiroshi
Data, Mr. Koji Kai, and Mr. Hideaki Fujikake for their cooperation.
I would like to thank my uncle, Takayuki Kurihara, and his wife, Hiroko Kurihara for
supporting my life. I also would like to thank my parents, Haruo Inoue and Chizuko Inoue,
for many years of love, care, and support.
Last, but not the least, thanks to my wife Tomomi and two children, Sakura, and Gaku,
for encouraging me.
Bibliography
[1] Agarwal, A., and Pudar, S. D., “Column-associative caches: A technique for reducing
the miss rate of direct-mapped caches, ” In Proc. of the 20th International Symposium
on Computer Architecture, pp. 179–180, May 1993.
[2] Agarwal, A., Hennesy, J., and Horowitz, M., “Cache performance of operating systems
and multiprogramming, ” In ACM Transactions on Computer Systems, volume 6, pp.
393–431, Nov. 1988.
[3] Albonesi, D. H., “Selective cache ways: On-demand cache resource allocation, ” In Proc.
of the International Symposium on Microarchitecture, pp. 248–259, Nov. 1999.
[4] Bahar, R. I., Albera, G., and Manne, S., “Power and performance tradeoffs using
various caching strategies, ” In Proc. of the 1998 International Symposium on Low
Power Electronics and Design, pp. 64–69, Aug. 1998.
[5] Bajwa, R. S., Hiraki, M., Kojima, H., Gorny, D. J., Nitta, K., Shridhar, A., Seki, K., and
Sasaki, K., “Instruction buffering to reduce power in processors for signal processign, ”
In IEEE Transaction on Very Large Scale Integration Systems, volume 5, pp. 417–424,
Dec. 1997.
[6] Bellas, N., Hajj, I., and Polychronopoulos, C., “Using dynamic cache management
techniques to reduce energy in a high-performance processor, ” In Proc. of the 1999
International Symposium on Low Power Electronics and Design, pp. 64–69, Aug. 1999.
[7] Bellas, N., Hajj, I., Polychronopoulos, C., and Stamoulis, G., “Architectural and com-
piler support for energy reduction in the memory hierarchy of high performance micro-
processors, ” In Proc. of the 1998 International Symposium on Low Power Electronics
and Design, pp. 70–75, Aug. 1998.
131
132 BIBLIOGRAPHY
[8] Bellas, N., Hajj, I., Polychronopoulos, C., and Stamoulis, G., “Energy and performance
improvements in microprocessor design using a loop cache, ” In Proc. of the International
Conference on Computer Design: VLSI in Computers & Processors, pp. 378–383, Oct.
1999.
[9] Benini, L., De Micheli, G., Macii, E., Sciuto, D., and Silvano, C, “Asymptotic zero-
transition activity encording for address busses in low-power microprocessor-based sys-
tems, ” In Proc. of the 7th Great Lakes Symposium on VLSI, pp. 77–82, Mar. 1997.
[10] Benschneidr, B. J., Park, S., Allmon, R., Anderson, W., Arneborn, M., Cho, J., Choi, C.,
Clouser, J., Han, S., Hokinson, R., Hwang, G., Jung, D., Kim, J., Krause, J., Kwack, J.,
Meier, S., Seok, Y., Thierauf, S., and Zhou, C., “A 1ghz alpha microprocessor, ” In
Proc. of the 2000 International Solid-State Circuits Conference, pp. 86–87, Feb. 2000.
[11] Burger, D. C., Austin, T. M., and Bennett, S., “Evaluating future microprocessors - the
simplescalar toolset, ” .
[12] Burger, D., Goodman, J. R., and Kagi, A., “Memory bandwidth limitations of future
microprocessors, ” In Proc. of the 23rd Annual International Symposium on Computer
Architecture, pp. 78–89, May 1996.
[13] Burger, D., Kaxiras, S., and Goodman, J. R., “Datascalar architectures, ” In Proc. of
the 23rd Annual International Symposium on Computer Architecture, June 1997.
[14] Calder, B., Grunwald, D., and Emer, J., “Predictive sequential associative cache, ” In
Proc. of the 2nd International Symposium on High-Performance Computer Architecture,
pp. 244–253, Feb. 1996.
[15] Caravella, J. S., “A low voltage sram for embedded applications, ” In IEEE Journal of
Solid-State Circuits, volume 32, pp. 428–432, Mar. 1997.
[16] Chang, J. H, Chao, H., and So, K., “Cache design of a sub-micron cmos system370, ”
In Proc. of the 14th International Symposium on Computer Architecture, pp. 208–213,
June 1987.
BIBLIOGRAPHY 133
[17] Chiou, D., Jain, P., Rudolph, L., and Devadas, S., “Application-specific memory man-
agement for embedded systems using software-controlled caches, ” In Proc. of 37th
Design Automation Conference, pp. 416–419, June 2000.
[18] Dahlgren, F., Dubois, M, and Stenstrom, P., “Fixed and adaptive sequential prefetching
in shared memory multiprocessors, ” In Proc. of the 1993 International Conference on
Parallel Processing, pp. 56–63, Aug. 1993.
[19] Delaluz, V., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J., “Energy-oriented com-
piler optimizations for partitioned memory architectures, ” In Proc. of the International
Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pp. 138–
147, Nov. 2000.
[20] Dubnicki, C., and LeBlanc, T. J., “Adjustable block size coherent caches, ” In Proc. of
the 19th Annual International Symposium on Computer Architecture, pp. 170–180, May
1992.
[21] Fisk, B. R., and Bahar, R. I., “The non-critical buffer: Using load latency tolerance to
improve data cache efficiency, ” In Proc. of the International Conference on Computer
Design: VLSI in Computers & Processors, pp. 538–545, Oct. 1999.
[22] Ghose, K., and Kamble, M. B,, “Reducing power in superscalar processor caches using
subbanking, multiple line buffers and bit-line segmentation, ” In Proc. of the 1999
International Symposium on Low Power Electronics and Design, pp. 70–75, Aug. 1999.
[23] Gonzalez, A., Aliagas, C., and Valero, M., “A data cache with multiple caching strate-
gies tuned to different types of locality, ” In Proc. of the International Conference on
Supercomputing, pp. 338–347, July 1995.
[24] Green, P. K., “A ghz ia-32 architecture microprocessor implemented on 0.18um tech-
nology with aluminum interconnect, ” In Proc. of the 2000 International Solid-State
Circuits Conference, pp. 98–99, Feb. 2000.
[25] Haji, N. B. I., Polychronopoulos, C., and Stamoulis, G., “Architectural and compiler
support for energy reduction in the memory hierarchy of high performance micropro-
cessors, ” In Proc. of the 1998 International Symposium on Low Power Electronics and
Design, pp. 70–75, Aug. 1998.
134 BIBLIOGRAPHY
[26] Hasegawa, A., et al., “Sh3: High code density, low power, ” In IEEE Micro, pp. 11–19,
Dec 1995.
[27] Hashimoto, K., Tomita, H., Inoue, K., Metsugi, K., Murakami, K., Miyakawa, N., In-
abata, S., Yamada, S., Takashima, H., Kitamura, K., Obara, S., Amisaki, T., Tanabe, K.,
Nagashima, U., and Hayakawa, K., “Moe: A special-purpose parallel computer for high-
speed, large scale molecular orbital calculation, ” In SuperComputing (SC99), Nov.
1999.
[28] Hennessy, J. L., and Patterson, D. A., “Computer architecture: A quantitative ap-
proach, ” In Morgan Kaufmann Publishers, Inc, 1990.
[29] Hicks, P., Walnock, M., Owens, R. M., “Analysis of power consumption in memory
hierarchies, ” In Proc. of the 1997 International Symposium on Low Power Electronics
and Design, pp. 239–242, Aug. 1997.
[30] Hill, M. D., “A case for direct-mapped caches, ” In IEEE Computer, volume 21, pp.
25–40, Dec. 1988.
[31] Hill, M. D., Larus, J. R., Lebeck, A. R., Talluri, M., and Wood, D. A., “Warts: Wisconsin
architectural research tool set, ” In http://www.cs.wisc.edu/larus/warts.html.
[32] Hofstee, P., Aoki, N., Boerstler, D., Coulman, P., Dhong, S., Flachs, B., Kojima, N.,
Kwon, O., Lee, K., Meltzer, D., Kowka, K., Park, J., Peter, J., Posluszny, S., Shapiro, M.,
Silberman, J., Takahashi, O., and Weinberger, B., “A 1ghz single-issue 64b powerpc
processor, ” In Proc. of the 2000 International Solid-State Circuits Conference, pp.
92–93, Feb. 2000.
[33] Hwu, W. W., and Chang, P. P., “Achieving high instruction cache performance with
an optimizing compiler, ” In Proc.of the 16th Annual International Symposium on
Microarchitecture, pp. 242–251, May 1989.
[34] Inoue, K., and Murakami, K., “Tag comparison omitting for low-power instruction
caches (in japanese), ” In IPSJ Technical Report, volume ARC140-6, pp. 25–30, Nov.
2000.
BIBLIOGRAPHY 135
[35] Inoue, K., Ishihara, T., and Murakami, K., “Way-predicting set-associative cache for
high performance and low energy consumption, ” In Proc. of the 1999 International
Symposium on Low Power Design, pp. 273–275, Aug. 1999.
[36] Inoue, K., Kai, K., and Murakami, K, “High bandwidth, variable line-size cache archi-
tecture for merged dram/logic lsis, ” In IEICE Transactions on Electronics.
[37] Inoue, K., Kai, K., and Murakami, K, “Dynamically variable line-size cache exploiting
high on-chip memory bandwidth of merged dram/logic lsis, ” In Proc. of the 5th In-
ternational Symposium on High-Performance Computer Architecture, pp. 218–222, Jan.
1999.
[38] Ishihara, T., and Yasuura, H., “A power reduction technique with object code merging
for application specific embedded processors, ” In Proc. of Design, Automation and Test
in Europe Conference 2000, pp. 617–623, Mar. 2000.
[39] John, L. K., and Subramanian, A., “Design and performance evaluation of a cache assist
to implement selective caching, ” In Proc. of the International Conference on Computer
Design: VLSI in Computers & Processors, pp. 510–518, Oct. 1997.
[40] Johnson, T., L., and Hwu, W. W., “Run-time adaptive cache hierarchy management via
reference analysis, ” In Proc. of the 19th Annual International Symposium on Computer
Architecture, pp. 315–326, June 1997.
[41] Johnson, T. L., Merten, M. C, and Hwu, W. W., “Run-time spatial locality detection
and optimization, ” In Proc. of the 30th Annual International Symposium on Microar-
chitecture, pp. 57–64, Dec. 1997.
[42] Jouppi, N. P., “Cacti home page, ” In
http://www.research.digital.com/wrl/people/jouppi/CACTI.html.
[43] Jouppi, N. P., “Improving direct-mapped cache performance by the addition of a small
fully-associative cache and prefetch buffers, ” In Proc. of the 17th Annual International
Symposium on Computer Architecture, pp. 364–373, June 1990.
[44] Jouppi, N. P., Boyle, P., Dion, J., Doherty, M. J., Eustace, A., Haddad, R. W., Mayo, R.,
Menon, S., Monier, L. M., Stark, D., Turrini, S., Yang, J. L., Hamburgen, W. R.,
136 BIBLIOGRAPHY
Fitch, J. S., and Kao, R., “A 300-mhz 115-w 32-b bipolar ecl microprocessor, ” In
IEEE Journal of Solid-State Circuits, volume 28, pp. 1152–1166, Nov. 1993.
[45] Juan, T., Lang, T., and Navarro, J. J., “The difference-bit cache, ” In Proc. of the 23th
Annual International Symposium on Computer Architecture, pp. 114–119, May 1996.
[46] Kaeli, R. D., and Emma, G. P., “Branch history table prediction of moving target
branches due to subroutine returns, ” In Proc. of the 18th Annual International Sym-
posium on Computer Architecture, pp. 34–42, May 1991.
[47] Kalamatianos, J., and Kaeli, D. R., “Temporal-based procedure reordering for improved
instruction cache performance, ” In Proc. of the 4th International Symposium on High-
Performance Computer Architecture, pp. 244–253, Jan./Feb. 1998.
[48] Kamble, M. B. and Ghose, K., “Analytical energy dissipation models for low power
caches, ” In Proc. of the 1997 International Symposium on Low Power Electronics and
Design, pp. 143–148, Aug. 1997.
[49] Kamble, M. B. and Ghose, K., “Energy-efficiency of vlsi caches: A comparative study, ”
In Proc. of the 10th International Conference on VLSI Design, pp. 261–267, Jan. 1997.
[50] Kawabe, N., and Usami, K, “Low power technique for on-chip memory using biased
partitioning and access concentration (in japanese), ” In IPSJ DA Symposium ’00, pp.
191–196, July 2000.
[51] Kaxiras, S., Hu, Z., Narlikar, G., and McLellan, R, “Cache-line decay: A mechanism
to reduce cache leakage power, ” In Proc. of Workshop on Power-Aware Computer
Systems, Nov. 2000.
[52] Kessler, R. E, Jooss, R., Lebeck, A., and Hill, M. D, “Inexpensive implementations of set-
associativity, ” In Proc. of the 16th International Symposium on Computer Architecture,
pp. 131–139, 1989.
[53] Kim, H. S., Vijaykrishnan, N., Kandemir, M., and Irwin, M. J., “Multiple access caches:
Energy implications, ” In Proc. of the IEEE CS Annual Workshop on VLSI, Apr. 2000.
BIBLIOGRAPHY 137
[54] Kin, J., Gupta, M., and Mangione-Smith, W. H., “The filter cache: An energy ef-
ficient memory stucture, ” In Proc. of the 30th Annual International Symposium on
Microarchitecture, pp. 184–193, Dec. 1997.
[55] Kirihata, T., Mueller, G., Ji, B., Frankowsky, G., Ross, J., Terletzki, H., Netis, D., Wein-
furtner, O., Hanson, D., Daniel, G., Hsu, L., Storaska, D., Reith, A., Hug, M., Guay, K.,
Selz, M., Poechmueller, P., Hoenigschmid, H., and Wordeman, M., “A 390mm2 16 bank
1gb ddr sdram with hybrid bitline architecture, ” In Proc. of the 1999 International
Solid-State Circuits Conference, pp. 422–423, Feb. 1999.
[56] Ko, U., Balsara, P. T., and Nanda, A. K., “Energy optimization of multi-level processor
cache architecture, ” In Proc. of the 1995 International Symposium on Low Power
Design, pp. 45–49, Apr. 1995.
[57] Kumar, S. and Wilkerson, C., “Exploiting spatial locality in data caches using spa-
tial footprints, ” In Proc. of the 25th Annual International Symposium on Computer
Architecture, pp. 357–368, June 1998.
[58] Lebeck, A. R., Fan, X., Zeng, H., and Ellis, C., “Power aware paga allocation, ” In Proc.
of the 9th International Conference on Architectural Support for Programming Language
and Operating Systems, pp. 105–116, Nov. 2000.
[59] Lee, H. S., and Tyson, G. S., “Region-based caching:an energy-delay efficient memory
architecture for embedded processors, ” In Proc. of the International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems, pp. 120–127, Nov. 2000.
[60] Liu. L, “Cache design with partial address matching, ” In Proc.of the 27th Annual
International Symposium on Microarchitecture, pp. 128–136, Nov./Dec. 1994.
[61] McFarling, S., “Cache replacement with dynamic exclusion, ” In Proc. of the 19th
Annual International Symposium on Computer Architecture, pp. 191–200, May 1992.
[62] Milutinovic, V., Markovic, B., Tomasevic, M., and Tremblay, M., “The split tempo-
ral/spatial cache: A complexity analysis, ” In Proc. of the SCIzz-L5, Mar. 1996.
[63] MPEG Software Simulation Group, “Free mpeg softwares mpeg-2 encoder / decoder,
version 1.2, ” In http://www.mpeg.org/ tristan/MPEG/MSSG/, 1996.
138 BIBLIOGRAPHY
[64] Murakami, K., Shirakawa, S., and Miyajima, H., “Parallel processing ram chip with
256mb dram and quad processors, ” In Proc. of the 1997 International Solid-State
Circuits Conference, pp. 228–229, Feb. 1997.
[65] Nakamura, H., Kondo, M., and Boku, T., “Software controlled reconfigurable on-chip
memory for high performance computing, ” In Proc. of the 2nd Workshop on Intelligent
Memory Systems, Nov. 2000.
[66] Nii, K., Makino, H., Tujihashi, Y., Morishima, C., Hayakawa, Y., Ninogami, H.,
Arakawa, T., and Hamano, H., “A low power sram using auto-backgate-controlled
mt-cmos, ” In Proc. of the 1998 International Symposium on Low Power Design, pp.
293–298, Aug. 1998.
[67] Ohsawa, T., Kai, K., and Murakami, K, “Optimizing the dram refresh count for merged
dram/logic lsis, ” In Proc. of the 1998 International Symposium on Low Power Design,
pp. 82–87, Aug. 1998.
[68] Panda, P. R., Dutt, N. D., and Nicolau, A., “Memory organization for improved data
cache performance in embedded processors, ” In Proc. of the International Symposium
on System Synthesis, pp. 90–95, Nov. 1996.
[69] Panda, P. R., Dutt, N. D., and Nicolau, A., “Efficient utilization of scratch-pad memory
in embedded processor applications, ” In Proc. of European Design & Test Conference,
Mar. 1997.
[70] Panwar, R., and Rennels, D., “Reducing the frequency of tag compares for low power i-
cache design, ” In Proc. of the 1995 International Symposium on Low Power Electronics
and Design, pp. 57–62, Apr. 1995.
[71] Park, G-H., “Desing and analysis of an adaptive memory system for deep-submicron and
processor-memory integration technologies, ” In PhD thesis, Department of Computer
Science The Graduate School Yonsei University, p. Dec., Nov. 1999.
[72] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.,
Thomas, R., and Yelick, K., “A case for intelligent ram, ” In IEEE Micro, volume 17,
pp. 34–44, Mar./Apr. 1997.
BIBLIOGRAPHY 139
[73] Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.,
Thomas, R., and Yelick, K., “Intelligent ram(iram) chips that remember and com-
pute, ” In Proc. of the 1997 International Solid-State Circuits Conference, pp. 224–225,
Feb. 1997.
[74] Peir, J. K., Lee, Y., and Hsu, W. W., “Capturing dynamic memory reference behavior
with adaptive cache topology, ” In Proc. of the 8th International Conference on Archi-
tectural Support for Programming Language and Operating Systems, pp. 240–250, Oct.
1998.
[75] Sakurai, T., et al., “Low-power high-speed lsi circuits & technology (in japanese, ” In
Realize, Inc, 1998.
[76] Sanchez, F. J, Gonzalez, A., and Valero, M., “Static locality analysis for cache manage-
ment, ” In Proc. of the International Conference on Parallel Architectures and Compi-
lation Techniques, Nov. 1997.
[77] Santhanam, S., “Strongarm sa110 -a 160mhz 32b 0.5w cmos arm processor-, ” In Hot
Chips 8: A Symposium on High-Performance Chips, Aug. 1996.
[78] Saulsbury, A., Pong, F., and Nowatzyk, A., “Missing the memory wall: The case for
processor/memory integration, ” In Proc. of the 23rd Annual International Symposium
on Computer Architecture, pp. 90–101, May 1996.
[79] Semiconductor Industry Association, “The national technology roadmap for semicon-
ductors, ” 1994.
[80] Seznec, A., “A case for two-way skewed-associative caches, ” In Proc. of the 20th Annual
International Symposium on Computer Architecture, pp. 169–178, May 1993.
[81] Shimizu, T., Korematu, J., Satou, M., Kondo, H., Iwata, S., Sawai, K., Okumura, N.,
Ishimi, K., Nakamoto, Y., Kumanoya, M., Dosaka, K., Yamazaki, A., Ajioka, Y., Tsub-
ota, H., Nunomura, Y., Urabe, T., Hinata, J., and Saitoh, K., “A multimedia 32b
risc microprocessor with 16mb dram, ” In Proc. of the 1996 International Solid-State
Circuits Conference, pp. 216–217, Feb. 1996.
[82] SPEC (Standard Performance Evaluation Corporation), In http://www.specbench.org/.
140 BIBLIOGRAPHY
[83] Srinivasan, S. T., and Lebeck, A. R., “Load latency tolerance in dynamically scheduled
processors, ” In Proc. of the 31th Annual International Symposium on Microarchitecture,
Nov.–Dec. 1998.
[84] Stan, M. R., and Burleson, W. P., “Bus-invert cording for low-power i/o, ” In IEEE
Transaction on Very Large Scale Integration Systems, volume 3, pp. 49–58, Mar. 1995.
[85] Su, C. L., and Despain, A. M., “Cache design trade-offs for power and performance
optimization:a case study, ” In Proc. of the 1995 International Symposium on Low
Power Design, pp. 69–74, Apr. 1995.
[86] Theobald, K. B., Hum, H. H. J., and Gao, G. R., “A design framework for hybrid-access
caches, ” In Proc. of the 1st International Symposium on High-Performance Computer
Architecture, pp. 144–153, Jan. 1995.
[87] Tomiyama, H., and Yasuura, H., “Code placement techniques for cache miss rate reduc-
tion, ” In ACM Transactions on Design Automation of Electronic Systems, volume 2,
pp. 410–429, Oct. 1997.
[88] Tomiyama, H., Ishihara, T., Inoue, A., and Yasuura, H., “Instruction scheduling for
power reduction in processor-based system design, ” In Proc. of Design Automation and
Test in Europe, pp. 855–860, Feb. 1998.
[89] Tremblay, M., and O’Connor, J. M., “Ultrasparci: A four-issue processor supporting
multimedia, ” In IEEE Micro, volume 16, pp. 42–50, Apr. 1996.
[90] Tyson, G., Farrens, M., Matthews, J., and Pleszkun, A. R., “A modified approach to
data cache management, ” In Proc.of the 28th Annual International Symposium on
Microarchitecture, pp. 93–103, Nov./Dec. 1995.
[91] Veidenbaum, A. V, Tang, W., Gupta, R., Nicolau, A., and Ji, X., “Adapting cache line
size to application behavior, ” In The International Conference on SuperComputing,
Nov. 1999.
[92] Vleet, P. V., Anderson, E., Brown, L., Baer, J., and Karlin, A, “Pursuing the perfor-
mance potential of dynamic cache line sizes, ” In Proc. of the International Conference
on Computer Design: VLSI in Computers & Processors, pp. 528–537, Oct. 1999.
BIBLIOGRAPHY 141
[93] Walsh, S. J., and Board, J. A., “Pollution control caching, ” In Proc. of the International
Conference on Computer Design: VLSI in Computers & Processors, pp. 300–306, Oct.
1995.
[94] Wilson, K. M. and Olukotun, K., “Designing high bandwidth on-chip caches, ” In Proc.
of the 24th Annual International Symposium on Computer Architecture, pp. 121–132,
June 1997.
[95] Wilton, S. J. E. and Jouppi, N. P., “An enhanced access and cycle time model for
on-chip caches, ” In Digital WRL Research Report 93/5, July 1994.
[96] Wilton, S. J. E. and Jouppi, N. P., “Cacti: An enhanced cache access and cycle time
model, ” In IEEE Journal of Solid-State Circuits, volume 31, pp. 677–688, May 1996.
[97] Wulf, W. A. and McKee, S. A., “Hitting the memory wall: Implications of the obvious, ”
In ACM Computer Architecture News, volume 23, Mar. 1995.
[98] Yeager, K. C., “The mips r10000 superscalar microprocessor, ” In IEEE Micro, vol-
ume 16, pp. 28–40, Apr. 1996.
[99] Zhang, C., Zhang, X., and Yan, Y., “Two fast and high-associativity cache schemes, ”
In IEEE Micro, volume 17, pp. 40–49, Sep.Oct. 1997.
142 BIBLIOGRAPHY
List of Publications by the Author
Journal Publications
[J-1] Inoue, K., Kai, K., and Murakami, K., “High Bandwidth, Variable Line-Size Cache
Architecture for Merged DRAM/Logic LSIs,” IEICE Transactions on Electronics, vol.
E81-C, no.9, pp.1438–1447, Sep. 1998.
[J-2] Inoue, K., Ishihara, T., and Murakami, K., “A High-Performance and Low-Power
Cache Architecture with Speculative Way-Selection,” IEICE Transactions on Elec-
tronics, vol. E83-C, no.2, Feb. 2000.
[J-3] Inoue, K., Kai, K., and Murakami, K., “Dynamically Variable Line-Size Cache Ar-
chitecture for Merged DRAM/Logic LSIs,” IEICE Transactions on Information and
Systems, vol. E83-D, no.5, pp.1048–1057, May 2000.
[J-4] Inoue, K., Kai, K., and Murakami, K., “A High-Performance / Low-Power On-chip
Memory-Path Architecture with Variable Cache-Line Size,” IEICE Transactions on
Electronics, vol. E83-C, no. 11, pp.1716–1723, Nov. 2000.
[J-5] Inoue, K., Ishihara, T., Kai, K., and Murakami, K., “High-Performance/Low-Power
Cache Architectures for Merged DRAM/Logic LSIs (in Japanese),” To appear in IPSJ
Journal, vol. 42, no. 3, Mar. 2001.
International Conference Publications
[C-1] Nakagaki, K., Ouchi, M., Inoue, K., Apduhan, B. O., kuga, M., Sueyoshi, T.,“Design
and Implementation of the Educational Microprocessor DLX–FPGA Using VHDL,”
Proceedings of the Second Asian Pacific Conference on Hardware Description Lan-
guages, pp.147-150, Oct. 1994.
143
144 LIST OF PUBLICATIONS BY THE AUTHOR
[C-2] Miyajima, H., Inoue, K., and Murakami, K., “On-Chip Memorypath Architecture
for Parallel Processing RAM (PPRAM),” Workshop on Mixing Logic and DRAM
(http://iram.CS.Berkeley.EDU/isca97-workshop/), June 1997.
[C-3] Murakami, K., Inoue, K., and Miyajima, H., “PPRAM (Parallel Processing RAM): A
Merged-DRAM/Logic System-LSI Architecture,” Proc. of The International Confer-
ence on Solid State Devices and Materials, pp.274–275, Sep. 1997.
[C-4] Inoue, K., Kai, K., and Murakami, K., “Dynamically Variable Line-Size Cache Ex-
ploiting High On-Chip Memory Bandwidth of Merged DRAM/Logic LSIs,” Proc.
of The Fifth International Symposium on High-Performance Computer Architecture
(HPCA-5), pp.218–222, Jan. 1999.
[C-5] Inoue, K., Ishihara, T., and Murakami, K., “Way-Predicting Set-Associative Cache
for High Performance and Low Energy Consumption,” Proc. of 1999 International
Symposium on Low Power Electronics and Design (ISLPED’99), pp.273–275, Aug.
1999.
[C-6] Hashimoto, K., Tomita, H., Inoue, K., Metsugi, K., Murakami, K., Miyakawa, N.,
Inabata, S., Yamada, S., Takashima, H., Kitamura, K., Obara, S., Amisaki, T., Tan-
abe, K., Nagashima, U., and Hayakawa, K., “MOE: A Special-Purpose Parallel Com-
puter for High-Speed, Large Scale Molecular Orbital Calculation,” SuperComputing
(SC99), Nov. 1999.
[C-7] Inoue, K., Kai, K., and Murakami, K., “An On-chip Memory-Path Architecture on
Merged DRAM/Logic LSIs for High-Performance/Low-Energy Consumption,” Proc.
of International Symposium on Low-Power and High-Speed Chips (COOL Chips III),
pp.283, Apr. 2000.
[C-8] Inoue, K., Kai, K., and Murakami, K., “Performance/Energy Efficiency of Variable
Line-Size Caches on Intelligent Memory Systems,” Proc. of The 2nd Workshop on
Intelligent Memory Systems, Nov. 2000.
[C-9] Inoue, K., and Murakami, K., “A Low-Power Instruction Cache Architecture Ex-
ploiting Program Execution Footprints,” To appear in Work-in-progress Session at
(not included in the proceedings of) The Seventh International Symposium on High-
Performance Computer Architecture (HPCA-7), Jan. 2001.
LIST OF PUBLICATIONS BY THE AUTHOR 145
Technical Society Meeting and Domestic Conference
Publications
[T-1] Nakagaki, K., Inoue, K., Kuga, M., and Sueyoshi, T.,“Design and Implementation of
the Educational Microprocessor DLX–FPGA for Advanced Computer Architecture
Course,” IEICE Technical Report, CPSY94–57, Sep. 1994.
[T-2] Inoue, K., Nakagaki, K., Ouchi, M., Kuriyama, T., Kuga, M., and Sueyoshi, T.,
“Implementation of the Floating-Point Pipeline for the DLX–FPGA Microprocessor
(in Japanese),” IPSJ SIG Notes, ARC-110-19, DA-73-19, Jan. 1995.
[T-3] Inoue, K., Nakagaki, K., Ouchi, M., Kuga, M., and Sueyoshi, T.,“Design and
Rapid System Prototyping of the Educational RISC Microprocessor DLX-FPGA (in
Japanese),” IEICE Technical Report, CPSY95-20, FTS95-20, ICD95-20, Apr. 1995.
[T-4] Sueyoshi, T., Inoue, K., Okumura, M., and Kuga, M., “Development of an FPGA
board for the 32-bit Educational RISC Microprocessor DLX–FPGA (in Japanese),”
Proc. of The Third Japanese FPGA/PLD Design Conference & Exhibit, pp.579–588,
July 1995.
[T-5] Inoue, K., Okumura, M., Kuga, M., and Sueyoshi, T., “Rapid Prototyping of the Edu-
cational 32-bit RISC Microprocessor DLX-FPGA,” Proc. of IPSJ General Conference,
vol. 6, 6P-2, Sep. 1995.
[T-6] Inoue, K., Iida, M., Ouchi, M., Kuga, M., and Sueyoshi, T., “A Feasibility Study for
Design Education Using 32bit RISC Microprocessor DLX-FPGA,” IPSJ SIG Notes,
ARC-115-18, DA-78-18, pp. 109-114, Dec. 1995.
[T-7] Inoue, K., Miyajima, H., Kai, K., and Murakami, K.,“An examination of On-chip
Memorypath Architecture for PPRAM-type LSI (in Japanese),” IEICE Technical Re-
port, ICD97-10, CPSY97-10, FTS97-10, pp. 25-32, Apr. 1997.
[T-8] Murakami, K., Inoue, K., and Miyajima, H., “PPRAM” A Merged Memory/Logic
System LSI Architecture (in Japanese),” Society Symposium Plan: New Trend of
VLSI Architecture, 55th IPSJ General Conference, Sep. 1997.
[T-9] Inoue, K., Kai, K., and Murakami, K.,“Dynamically Variable Line-Size Caches
Exploiting High On-Chip Memory Bandwidth of Merged DRAM/Logic LSIs (in
146 LIST OF PUBLICATIONS BY THE AUTHOR
Japanese),” IEICE Technical Report, ICD98-25, CPSY98-25, FTS98-25, pp. 109-116,
Apr. 1998.
[T-10] Inoue, K., Ishihara, T., and Murakami, K.,“A High-Performance/Low-Energy Cache
Architecture with Way-Prediction Technique (in Japanese),” IEICE Technical Report,
VLD98-44, ICD98-147, FTS98-71, pp. 1-8, Sep. 1998.
[T-11] Inoue, K., Ishihara, T., and Murakami, K.,“A High-Performance Set-Associative
Cache Architecture with Speculative Way-Selection (in Japanese),” IEICE Techni-
cal Report, DSP98-94, ICD98-181, CPSY98-96, pp. 35-42, Oct. 1998.
[T-12] Inoue, K., Kai, K., and Murakami, K.,“Performance and Energy Evaluation of a Dy-
namically Variable Line-Size Cache (in Japanese),” IEICE Technical Report, ICD2000-
5, pp. 25-30, Apr. 2000.
[T-13] Inoue, K., and Murakami, K.,“Tag Comparison Omitting for Low-Power Instruction
Caches (in Japanese),” IPSJ SIG Notes, ARC140-6, pp. 25–30, Nov. 2000.