CS-2002-05
Building DRAM-based High Performance Intermediate
Memory System
Junyi Xie and Gershon Kedem
Department of Computer Science
Duke University
Durham, North Carolina 27708-0129
May 15, 2002
Abstract
Although the speed of modern microprocessor improves at a rate of around 80% per year, the
memory speed improvement is unfortunately only about 5% every year. According to Amdal’s
law, the overall computer performance is determined not only by the processor, but by the memory
system speed. Previous research shows that typically the access time to memory system accounts
for more than 50% of the execution time of the program. Given the widening performance gap
between processors and memory system, it is essential to build high-performance memory sys-
tems to alleviate the disparity between them.In this research project, we build a DRAM-based
cost effective large and fast cache system which is able to decrease the high cost incurred by
the power hungry static SRAM and improve cache and overall system performance. In addition,
based on this design, we also integrate into this cache model two prefetch schemes to predict and
aggressively prefetch the next referenced cache line from next level of memory hierarchy to fur-
ther improve cache performance. Compared with traditional prefetching cache, our prefetching
schemes introduce much less hardware cost since we only need to maintain the prediction infor-
mation for the small SRAM buffer. Our simulation shows that our 1M DRAM based on-chip L2
cache with only 64K fast SRAM buffer can outperform a typical 256K on-chip SRAM based L2
cache by 118% in average on 10 benchmarks from SPEC95 and SPEC2000 with high miss rate
over 5% on baseline cache.
1
1 Introduction
The memory hierarchy of current computing systems has been under the pressure placing by contin-
ued improvements in processor performance, in particular, the sharp increase in clock frequencies.
The speed gap between processor and memory hierarchy is continuously widening since the general
purpose processors are getting faster and faster. Generally speaking, the processor speed increases
by a factor of 2 every 18 months, however, DRAMs speed improves only at the rate of 5% per year.
As a result, the computer system has already passed the time point at which overall performance is
not only determined by processor speed but also by memory system speed since the performance of
computing systems now highly depends on performance of memory system. Current system design-
ers employ a wide range of techniques to reduce or tolerate memory system delays including dynamic
scheduling, speculation execution, multilevel caches, non-blocking accesses and prefetching in cache
hierarchy. One of the intensively explored fields in these techniques are to build high performance
cache hierarchy within current memory technologies. Currently almost all microprocessors make use
of on-chip level-1 or level-2 cache to provide fast access to data to achieve high processor perfor-
mance. These on-chip caches are usually built on fast and expensive static memory(SRAM) to keep
pace with processors. However, due to chip area constraints and other restrictions such as energy con-
sumption, on-chip cache can not be built large enough to provide high data hit rate. In consequence,
due to huge miss penalty resulting from the wide speed gap between processor and memory, system
performance will be slowed down greatly. In this project, we explore the possibility to build high
performance and cost effective external cache based on dense and relatively slow DRAM memory.
2
In our design, we combine the large DRAM arrays together with small fast SRAM buffers on single
integrated circuit to form a cache subsystem. The cache design proposed in this paper can be either
external or on-chip L2 cache. Typically, size ratio of DRAM arrays over SRAM buffer is over 16.
Since the DRAM arrays account for most of size of the cache system, the cost per bit in the cache can
be kept low accordingly. Moreover, we also design it as a DRAM-page based prefetching cache using
different prediction schemes. We organize this report as follows, first in section 2 we introduce the
previous related research work in this field. In section 3 we will introduce the design of DRAM-based
cache system in detail. We will introduce the page based prediction and prefetching for DRAM-based
cache in section 4. In section 5,we presents the simulation environment and experimentation result to
compare performance of this cache with the baseline cache architecture. We conclude this report by
a summary and future work in section 6.
2 Related Work
To reduce the overhead of retrieving data from memory system for programs with large working
set, different technologies including fast tag matching for wide set-associative cache[15] and cached
DRAM have been used for some time. In 1995, Alexander & Kedem [2] [1] introduced a new archi-
tecture of main memory system in which the L2 cache is integrated into the DRAM array to form a
distributed cache. Wong[18] quantified the performance improvement which is obtained by adding
some SRAM cache on the DRAM chip on various design alternatives for the line size and associativity
3
of the SRAM cache in 1997. Newer DRAM technologies such as CDRAM and EDRAM use a small
SRAM page to replace the page buffer and use on-chip DRAM caching to eliminate the drawbacks
of page-mode DRAMs[4] [8] [14]. Koganti&Kedem [13][12] introduced the WCDRAM architecture
which takes advantage of very wide cache-lines integrated into DRAM in order to reduce the average
DRAM access time. Because data prefetching can hide memory latency by bringing data into higher
level of memory before they are required, it is widely used and different prefetching and prediction
schemes have been proposed. Smith[16] proposed the one-block-looking ahead and evaluated this
scheme extensively. In 1990, Jouppi[11] first introduced concept of stream buffer which prefetches
consecutive blocks starting from missed cache block into a small buffer associated with cache. Stride
prefetching was first proposed by Chen&Baer [3][5][6] which uses stride in past references to predict
future referenced cache block. Although history based prediction scheme does not help for DRAM-
based cache in this project, it is quite effective in conventional cache system. 1st order Markov
prediction was proposed by J. Pomerence[7]. Alex&Kedem[1][2] use distributed table-based pre-
diction to prefetching cache blocks and in 1997, Joseph&Grunwald[10] extensively evaluated cache
block prefetching from L2 to L2 cache based on 1st order Markov predictors. Kedem&Yu [19] ex-
plored DRAM-page based prefetching from memory to L2 cache using different prediction schemes.
Lin&Reinhardt[17] evaluated the aggressive prefetch unit integrated into L2 cache and memory con-
troller. Their schemes only issue prefetches when the Rambus channels are idle. And Okuda el[9].
described the circuit technologies that have been developed for high speed large bandwidth on-chip
12ns 8MB DRAM secondary cache.
4
3 DRAM-based External Cache
3.1 Overview
Figure 1 is block diagram of DRAM-SRAM cache and Figure 2 is block diagram of the data module
in figure 1. Here we use the cached DRAM ICs to implement this cache design. By using DRAM
array which is associated with a fast SRAM buffer, our design can implement a very large external
cache. The cache subsystem consists of three components:cache controller, cache-data memory and
cache-tag memory. The cache-data memory is in the cache where the cached data are stored. It can
be divided into multiple independent banks. For illustration purpose, we describe here an instance
with four independent banks, each of which can operate independently. Therefore, the data stored in
the DRAM array is four-way interleaved. Within each bank, the DRAM array does not interact with
outside directly. Instead, data is transferred between DRAM array and outside memory bus via a small
fast SRAM buffer. Generally the SRAM buffer is used to hold a set of active DRAM cache pages.
Logically it is the SRAM buffer that is connected to the outside. And the sense amplifier works as
an interface between the DRAM array and associated SRAM buffer. Due to the fact that we integrate
the DRAM array and SRAM buffer into the same IC chip, we can easily achieve very high internal
bus bandwidth between SRAM buffer and DRAM array. In particular, it takes one cycle to transfer
one complete DRAM page between sense amplifier and the associated SRAM buffer. In addition, the
DRAM array is implemented with technology optimized for speed instead of density. Specifically,
if we assume that the row access time of conventional DRAM chips is around 30ns, it takes only
5
CacheController
Primarytag array
Secondarytag array
Processor Interface
DRAMbank
SRAM buffer
DRAMAddress
Control
SRAMAddress
DRAMbank
SRAM buffer
DRAMbank
SRAM buffer
DRAMbank
SRAM buffer
Data-module
Figure 1: DRAM-based Cache Block Diagram
12-18ns for the DRAM arrays used in this cache subsystem to do one row access operation and make
data ready in sense amplifier. The cache-tag memory is composed of two sub tag arrays: primary tag
array and second tag array. Like any conventional cache architecture, the tag arrays use fast SRAM to
hold the tag information to speedup the cache access. Since most of cached data reside in the DRAM
array, we can build very large yet cost effective cache subsystem without being limited by the cost
and size constraints imposed by SRAM based cache. In this project, we simulate an internal four
way set associative cache, alternatives such as direct map or other set associative caches can also be
implemented.
6
Write queue
DRAM Array
Sense Amplifiers with latches
SRAM Array
DRAM Array
Sense Amplifiers with latches
SRAM Array
DRAM Array
Sense Amplifiers with latches
SRAM Array
DRAM Array
Sense Amplifiers with latches
SRAM Array
Figure 2: Data Module of DRAM-based Cache
3.2 Cache Controller and Tag Arrays
In the block diagram Figure 1, there are three major components built into the cache subsystem, a
cache controller, tag memory and cache-data memory. Just as the counterpart in conventional cache
system, the cache controller here works as the interaction between processor and cache memory
system to coordinate data traffic between them. When a reference access arrives at the cache, the
controller first looks up the tag arrays to see whether the required data is already stored in cache. If it
is a cache hit, the cache controller updates the tag memory and transfer the required data back to the
processor. If not, it issues a memory request to fetch the corresponding cache line from next level of
memory hierarchy. In addition, it also manages the internal data transfer between the DRAM arrays
and associated SRAM buffers. If prefetching hardware is integrated into the DRAM-based cache, the
7
cache controller predicts candidates of next possibly referenced cache line for current cache access
and issues corresponding memory request to fetch the prefetched cache line.
In addition to the DRAM banks and associated SRAM buffers, the cache system works like any
other cache system using tag memory to manage the cached data. Basically there are two tag arrays
connected with the cache controller: primary tag array and second tag array, both implemented in fast
SRAM. Just like the tag array in conventional cache, the primary tag array is used to access the data
in cache by matching tag portion of the address of incoming data requiring and that of stored data.
However, there are two differences between the tag array of the DRAM-based external cache and that
of conventional cache. First, in conventional cache, each cache line or each set of cache lines can
be identified by tag portion in its address. Cache tag array maintains one tag entry for each set of
cache blocks in set associative cache or, each cache block in direct mapped cache. The index portion
of incoming reference address is used to index the target cache block and tag portion is sufficient to
decide whether a cache block is in cache given block address of reference address. However, in the
DRAM-based cache, tag array maintains one tag entry for each DRAM page rather than each cache
block in DRAM arrays. In consequence, the tag portion of incoming reference address can only be
used to indicate residence of the corresponding DRAM page in which the target cache block can re-
side. The residence of one DRAM page does not necessarily mean every cache line in this page also
resides in cache. This is because the DRAM based cache stores cached data in a way that it takes one
DRAM page as one cache block. Therefore, when a missing cache line is fetched from next level of
memory, and if the corresponding DRAM page in which this cache line resides is not yet allocated in
8
cache, the cache controller will allocate a whole DRAM page for this cache line. At this time point,
this allocated DRAM page in cache has only one valid cache line and all other cache lines are invalid
or empty. We call the invalid or empty cache lines in one DRAM page in cache asholes. Obviously,
only tag information of incoming reference address itself is not enough to decide whether the target
cache line is already in cache since there can possibly exist holes in the corresponding cached DRAM
page. To address this problem, we introduce an array of bits for each DRAM page in cache to indicate
whether the specific cache line is present in cache. For example, if the DRAM page size in cache is
4K Bytes and the cache line size 128 Bytes, there are 32 cache lines in one DRAM page. As a result,
we maintain an array of 32 binary bits for each DRAM page as indication of presence of the 32 cache
lines. When a cache line is fetched from next level of memory, the corresponding present bit is set
to 1 which means this cache line is already in cache. The presence array is also implemented in tag
memory. The tag array together with presence array are enough for cache controller to decide whether
the specific cache line of incoming reference address is already cached in DRAM based cache. Figure
3 shows the structure of the tag entry in primary tag memory, in whichTAG is the tag portion of
cached DRAM page, bitV andB indicate whether this page is valid and buffered in SRAM.PA is
the presence array andI is the index of the DRAM page in SRAM buffer if it is buffered. Figure 4
shows an example entry in primary tag array. In this example, the tag of a valid, buffered cache page
is 0x0724 and its index is 0x4E. In this page only the first two cache lines are present, so present
bits array is 0x3. This cache page is buffered in the 11th(0xB) entry in SRAM buffer. Thesecond
difference between DRAM-based tag memory and that in conventional cache is that there are two sub
9
tag memories: primary and second tag array in DRAM-based cache while there is only on tag array in
conventional cache. We maintain two tag arrays for DRAM-based cache because besides the primary
tag array by which we use to index the target cache line, we need a second tag array as a backward
pointer to index the cached DRAM page which is buffered in fast SRAM. We employ a write back
policy between SRAM buffer and DRAM array. Thus when a page in SRAM should be evicted, we
need to write it back to DRAM if it is dirty. The second tag array is used to index where to write
back a dirty page in SRAM. Figure 5 shows the entry in secondary tag array, in whichIndex is used to
index the tag entry in primary tag memory,TAG is the tag portion of the buffered page in SRAM and
bit D indicates whether this page is dirty or not. Figure 6 shows entry of a dirty page in SRAM which
buffers the cache page of figure 4. We will introduce the index algorithms and replacement policy of
this cache in detail in the section Cache Operations.
V Tag Index Present Bits Buffered AddressB
Figure 3: Primary Tag Entry
1 0x0724 0x4E 0x3 0xB1
Figure 4: Sample Primary Tag Entry
10
Tag Index D
Figure 5: Second Tag Entry
0x0724 0x4E 1
Figure 6: Sample Second Tag Entry
3.3 Cache-data Memory
The DRAM array is used to memorize the cached data.It accounts for most of the IC area compared
with SRAM buffer, thus greatly reduces the cost per bit. The DRAM arrays can be modeled to one
or several independent banks and each bank is associated with SRAM buffer and performs indepen-
dently. Figure 7 shows one single bank DRAM-based cache in which only cache-data memory is
illustrated and both cache controller and tag memories are omitted. For each bank of DRAM array,
the DRAM arrays are associated with a buffer made of fast and expensive static memory(SRAM).
Logically the SRAM buffer is organized as a set of rows, each of which is as wide as that in DRAM
array so each row in SRAM can hold data of one DRAM page. Additionally, the SRAM buffer acts
as a fully associative cache with LRU replacement policy. In order to understand the cache behavior,
we can view the DRAM array with SRAM buffer as a data cache and the SRAM buffer serves as an
internal cache to DRAM arrays. Besides, the wide on-chip data bus with high bandwidth between
11
DRAM Bank
Sense Amp.
SRAMBuffer
Col. Sel.
DRAMAddress
SRAMAddress
Figure 7: DRAM-based Cache Block Diagram
SRAM buffer and DRAM array make it possible to transfer one DRAM page between SRAM buffer
and the sense amplifier in only one clock cycle. The data stored in one DRAM page(one SRAM row)
map one segment of consecutive main memory, but it does not mean the whole segment with size of
one DRAM page, should be transferred together between main memory and cache system. Actually,
from the prospective of either upper or next level memory, this cache subsystem works as any con-
ventional cache. Thus, the data transfer between this cache and upper or next level memory hierarchy
is still cache line based, in which each external data transfer will transfer one cache line.
12
3.4 Cache Operations
When the processor initiates a data request to on-chip L1 cache and if L1 miss occurs, this request
is passed from L1 cache to this external cache system. The cache controller extracts the tag, index
portion and cache line offset within one page of the incoming address and does a look-up in primary
tag memory to see whether the target cache line is already in the cache. If this is a cache hit, the
incoming address is parsed and the row access address is sent to DRAM array address bus to activate
one DRAM page into sense amplifier. And at the same time, if the DRAM page is already buffered in
SRAM buffer, the buffered address is read from primary tag and sent to address bus of SRAM buffer
and thus cache line is served to the request. We call this kind of hit thefast hit since the data can be
served directly from fast SRAM buffer. Otherwise, if the DRAM page is not buffered yet, we need
to transfer this page from DRAM array to SRAM buffer first and then serve the data to processor. In
this case, we call itdelayed hit since it takes longer to serve the data due to internal page transfer. The
last possible case is cache miss, which means the data is neither in DRAM array nor SRAM buffer,
in this case the cache system initiates an access request to the next level of memory hierarchy as any
conventional cache does. Next two subsection give detailed descriptions on the cache behavior when
an external memory request arrives.
3.4.1 Read Memory Request
Table 1 and 2 summerize the cache behaviors and corresponding actions when the cache system gets
a data read request. First of all it does a look up in the primary tag array. Depending on different
13
status of tag matching, valid bit, presence bit and buffered bit, there are at most six possible caches of
cache access as illustrated in table 1 and 2:
1. The corresponding page in DRAM is invalid. It is obviously a cache miss since the whole
cache page does not reside in the DRAM array. In this case, cache controller will open an
unused cache page in DRAM and SRAM, evicting an old page if necessary and fetching the
cache line from next level of memory.
2. The tag does not match although the related cache page is valid in DRAM. As same as cache
one, cache controller needs to allocate a cache page in DRAM and SRAM and fetch the corre-
sponding cache line from next level of memory.
3. If the tag is matched successfully and page is valid but presence and buffered bit is zero. That
means although the enclosing cache page is already in DRAM but the target cache line is miss-
ing. In this cache, cache controller will fetch it first and allocate a row in SRAM for this page.
4. This case is the same as case three except that the buffered bit is set. In this case, cache controller
does the same thing as in case three. The only difference is that it does not need allocate a
SRAM page since it is already buffered.
5. If the target cache line is already in DRAM array. This is a case of cache hit. If the page is not
yet buffered(buffered bit is zero.), it will be copied from DRAM to SRAM first and then cache
can serve data. It is a delayed hit since data is delayed by the miss in SRAM.
14
6. The last case is fast hit. In this case, the target cache line is not only already in DRAM array but
also buffered in SRAM. Thus data can be served directly at the latency of fast SRAM buffer.
3.4.2 Write Memory Request
When a write access request arrives, the cache behaviour is almost the same as the read access. When
a write miss occurs, the request is passed to the next level of memory and a cache page is allocated
in both DRAM and SRAM to hold the fetched cache line and the dirty bit is set for the page. In
this sense, it is a write allocate cache. And from the prospective of processor, this cache system
applies a write-through policy so when no unused DRAM page is available, we just discard a DRAM
page under a certain replacement policy, e.g, LRU, and nothing is written back to the next level of
memory. Actually write-through policy is not the only choice because we can apply the write-back
policy instead of write-through policy. For write-back, we can choose to write back the whole DRAM
page back or, just write back those dirty cache lines in DRAM page. However, in either case we need
to maintain a write back with the size of one cache page because we can not predict how many cache
lines in the page are dirty and need to be written back. This incurs extra hardware cost obviously
and the extra write back buffer is much larger than that in conventional cache since it needs to hold
the whole cache page rather than single cache line. In this project we choose write-through due to its
hardware simplicity. On the other hand, from viewpoint within the cache, the cache subsystem applies
write back policy between the SRAM buffer and DRAM array. That means, when a dirty SRAM is to
be evicted under certain replacement policy, the cache controller needs to first copy the victim cache
15
Action Valid Tag Match Present Buffered Status
1 0 - - - Miss
2 1 N - - Miss
3 1 Y N N Miss
4 1 Y Y N Delayed Hit
5 1 Y N Y Miss
6 1 Y Y Y Hit
Table 1: Cache Behavior On Data Read
Action Description
1 open cache page in both DRAM and SRAM, evict if necessary, fetch cache line
2 replace cache in DRAM, open a page in SRAM, evict if necessary, fetch cache line
3 evict SRAM page, copy page from DRAM into SRAM, fetch cache line
4 evict SRAM page, copy page from DRAM into SRAM, serve data
5 fetch cache line
6 serve data directly
Table 2: Cache Action Description On Data Read
16
page in SRAM buffer into the DRAM array then allocate the victim page to a new DRAM page. The
write back policy makes it possible that the data in DRAM array may be inconsistent with that in
SRAM buffer and thus the data in DRAM may not be the most updated. But it does not have any
effect for incoming data reference request since all data requests are served by SRAM in cache hit.
We apply write back policy on internal data transfer in order to reduce traffic on internal wide data
bus between SRAM and DRAM. In this inner write-back process, the second tag array is used to be a
backward pointer from SRAM buffer to DRAM array .
4 Data Prefetching
4.1 Overview
Data prefetching is to predict what data will be referenced in near future based on some heuristics
and prefetch the cache before processor asking for it. Data prefetching can effectively hide high
cache miss penalty since it makes the data ready for processor before the miss occurs. Now a variant
of prefetching techniques have been widely employed in different cache designs. In DRAM-based
cache subsystem, intuitively there are two opportunities which we can exploit to do data prefetching.
One is to prefetch cache line between the cache subsystem and next level of memory to increase
overall hit rate of cache subsystem. The other lies inside the cache between the small fast SRAM
buffer and DRAM arrays. Intuitively, inner prefetching can not improve overall hit rate since it does
not prefetchng anything from outside. But it can help reduce the miss in SRAM and make more cache
17
hits in SRAM thus reduce the average cache access time. In this project we take advantage of the page
based prediction and prefetching between cache and next level of memory. We do not focus on inner
prefetching for two reasons. First, our simulation shows that a very small SRAM buffer, e.g, 64K or
128K, can catch most of the cache hits to this cache subsystem. In particular, our simulation on 1M
DRAM array shows that 64K SRAM buffer can catch more than 96% of the cache hit in average. That
means more than 96% of the cache hits to the cache system fall into SRAM buffer and only less than
4% of cache hits are delayed hits. Prefetching another page from DRAM to SRAM may not improve
the performance since there is no much room to reduce the delayed hit rate and, due to the small size
of SRAM, it will cause possible cache pollution in SRAM to counteract benefits from prefetching.
The other reason is that implementing prefetching from DRAM to SRAM issues much more hardware
cost than prefetching between cache subsystem and next level of memory. It is because we need to
make a prediction table for each cache page to do predict which cache line to prefetch. In SRAM the
size is very limited such as 16 cache pages thus we only need to make 16 prediction table for each of
them. However, DRAM array is usually bigger than SRAM by a factor of 16 or even more. Therefore,
implementing inner prefetching will not make much sense if the miss in DRAM is low enough from
prefetching between cache subsystem and next level of memory.
4.2 DRAM-page Based Prefetching
In DRAM-page based prefetching, when a cache line is demand fetched from main memory to cache,
cache system predicts another cache line, which resides in the same DRAM-page as the demand
18
fetched cache line, and prefetches it with the demand fetched cache line into cache together. Good
prediction heuristics can predict the cache line which is most possibly referenced in near future. In
current DRAM technology, one DRAM access cycle can be divided into three parts in order: row
access, column access and precharge period. A row access operation reads out a whole page of data
into sense amplifier, and following consecutive pipelined column access command are issued to read
out specific cache line. Once an access is complete, the memory controller must precharge DRAM
bank and associated sense amplifiers and in precharge period, data in sense amplifier are lost. The key
fact making DRAM based prefetching effective is that once a DRAM page is open, the incremental
task of fetching additional cache line from that page requires smaller access time compared with
fetching another cache line in another DRAM page. Actually, additional time to read another cache
line in the same open page requires only about 20% of time to read read another cache line from
DRAM.
4.3 Prediction Heuristics
When a prefetching is triggered by a trigger address, different prediction schemes can be used to
generate prefetching candidates. The commonly used schemes include history based prediction such
as 1st order Markov prediction, stride predition, and one block lookahead(OBL) prediction. For
DRAM-based cache subsystem, first history based prediction schemes are excluded. This is because
all cache lines within one DRAM page are evicted together when this page is evicted. Therefore,
the history reference pattern does not make any sense for prediction because all cache lines in the
19
history reference pattern are already in the cache and thus can not generate any prefetching candidate.
For example, we assume for a certain DRAM page in SRAM(we only do prefetching for pages in
SRAM), the past reference pattern is�� �� �� �� � where�� �� � are different cache lines in the page.
Suppose the current referenced cache line is�, according to the pattern we should prefetch� into
cache since� should be the most possible cache line referenced in near future. However, since all
referenced cache lines are in the single DRAM page in SRAM and they are always evicted together
when this page is evicted from SRAM, we can be sure that cache line� is already in this page when
making prediction since it appears in the history reference pattern. Thus any prediction based on
history reference pattern loses its meaning since all predicted cache lines are already in cache and
do not require any prefetching. In consequence, we use stride prediction and OBL based aggressive
hole-looking prediction as prediction schemes in this DRAM based cache subsystem.
These two schemes are actually very simple and easy to implement in hardware without incurring
much hardware overhead. The first one is stride prefetching. In stride prefetching, we use two shift
registers to keep the last two referenced cache lines. When a trigger address is encountered, we
compare the difference(stride) between the last two referenced cache lines in shift registers and the
difference between current referenced cache line and last referenced cache line. If they are the same,
we can know so far the three references demonstrate a fixed stride. Thus it is reasonable to predict
that the next referenced cache line is the current one plus this stride. For example, if shift registers
hold entry��� �� and current referenced address in cache line� then stride is� � � � �� � � � and
the next cache line to be referenced is probably� � � � . So we prefetch cache line if this cache
20
line not yet in cache subsystem. The second scheme is hole-searching prefetching. In this scheme,
when a trigger address is encountered, the presence array of this page in tag memory is search from
current referenced cache line to see whether there is a hole - an absent cache line - in the cache page,
if so, use this absent cache line as prediction candidate. If the search goes to the end of one page, it
rewinds and starts from the beginning until the whole presence array is searched. If there is no hole in
this page which means all cache lines in this page are already in cache, then no prefetching is issued
since no prediction is generated. In our simulation, it shows that these simple schemes work quite
well for all cache configurations.
4.4 Hardware Cost
Since we only make prediction and do prefetching for the cache pages in SRAM buffer and size
of SRAM buffer is generally very small. The extra hardware overhead for prefetching can be kept
much lower compared with that in traditional external cache. In this DRAM based cache, each row
in SRAM buffer is associated with two shift registers to implement stride prediction. The length of
the shift register is the number of cache lines contained in one cache page. For example, if the size
of cache page is� bytes which is typical in current DRAM technology, and the cache line size is�
bytes, thus the length of shift register�is
� � ���
�(1)
The shift register is used to implement stride prediction scheme. In hole-searching prediction scheme,
we do not need any extra storage since we can decide if a hole exists in cache page by the presence
21
array in primary tag memory. Since history based prediction schemes can no longer be applied, we
do not need to allocate extra memory to record the past reference pattern for each cache line. In
the contrast, conventional prefetching cache requires a prediction table cache(PTC) built on fast and
expensive SRAM associated with prefetch controller. For a conventional cache of size 256KB, its
PTC is usually 32K to hold reference trace for each cache line in cache. But DRAM-based cache
does not require any extra memory except limited number of shift registers. Therefore, DRAM-based
cache reduces much hardware overhead compared with conventional prefetching cache.
5 Simulation Results
In this section we present the evaluation methodology and simulation environment first. It then
presents the simulation results on DRAM-based cache with different configurations and prefetching
schemes.
5.1 Methodology
To evaluation performance of DRAM-based cache, we build an event driven cache simulator based
on the detailed execution driven out-of-order processor simulator in Simplescalar. We also model
the main memory behaviors precisely using event driven simulation. The processor we simulate is a
four-way issue super-scalar with speculative out of order execution processor. It has has 32 register
update units (RUU) and a 16 entry load/store queue(LSQ). The processor has separate on-chip 16K
22
Parameter Cycle
SRAM latency in cache 6
DRAM latency in cache 12
latency to start a memory transaction 60
time to send row address 10
latency between accepting the row address and accepting column address20
latency between accepting the row address and sending back data 50
time to send all column addresses 80
time to send first chunk of data back 10
time to send data back 80
time to precharge before accessing another page 30
latency to finish a memory transaction 60
Table 3: Paramters of DRAM cache and Main Memory
23
directed mapped data and instruction L1 cache with cache line size of 32Bytes. The default processor
clock frequency is 1GHz. The unified dram-based cache system is pipelined so it can handle one
request each processor cycle. The data L1 cache applies write allocate, write back policy and LRU
replacement policy. For main memory, there are three sorts of transactions: read one cache line,
write one cache line and read two cache lines in one DRAM page. Table 3 shows the parameters of
DRAM-based cache and main memory. So the roundtrip memory latency is 190 cycles(latency to start
a memory transaction + time to send row address + latency between accepting the row address and
sending back data + time to send first chunk of data back + latency to finish a memory transaction).
The benchmarks we use are SPEC95 and SPEC2000 suites including both integer and float point
Figure 8: Selected Benchmarks
benchmarks. The baseline cache architecture we use is a conventional 256K four-way associative
L2 cache with block size 64Bytes. Since we would like to measure DRAM cache on the high miss
24
rate benchmarks in SPEC95 and SPEC2000. We run simulation on all benchmarks we have. From
these benchmarks we choose following ten benchmarks which have L2 cache miss rate higher than
5%. Figure 8 shows the hit rate of selected integer and FP benchmarks. Among those benchmarks,
������� ��� ����� ������ ������ are from SPEC95 and����� ����������������� ������
are from SPEC2000. And, since integer benchmarks generally have lower miss rate than FP bench-
marks, only two of ten programs chosen are integer benchmarks(mcf, twolf). The average L2 cache
hit rate for these ten benchmarks is 71%.
5.2 DRAM-based Cache Subsystem
When using cheap DRAM arrays as back end storage for cached data, we can build a very large
on-chip L2 cache. In the first set of experiments we explore what impact a large back-end storage
will have on system performance. We build an internal on-chip DRAM-based prefetching cache
simulator in which the DRAM arrays are modeled as four bank, four-way associative cache. The
DRAM arrays are equally distributed along four banks. Each bank is associated with a full associative
SRAM buffer with size of 64 KBytes. So the total size of SRAM in cache is 256 KBytes. And
the cache page size is 4096 Bytes and cache line size is 128 Bytes, so each cache page can hold
32 cache lines. We also implement the page based prediction and prefetching mechanisms for this
cache. Figure 9 shows the performance comparison between conventional cache and DRAM-based
cache. The prefeching performance consists of both the stride prediction and the hole-searching
prediction. Actually, there is no much difference on performance between stride prediction and hole-
25
Figure 9: Selected Benchmarks
search prediction, we will describe it in next subsection. The IPC shown in figure is the average of all
selected benchmarks and has been normalized with base line architecture. From the figure first of all
we can know that by using the DRAM-based cache the performance is increased about 120% in size
1MB, 2MB, 4MB and 8MB DRAM arrays and more than 150% in 16M and 64M DRAM arrays. The
average performance gain of all sizes of DRAM cache is 171%. By prefetching mechanisms, system
performance increases by another 17% in average. We also can see that when we increase the size
of DRAM from 1M to 8M, the overall performance is not improved much for both non-prefetching
and prefetching DRAM-based cache. When DRAM size increases to 16MB, system performance
increases about 27% in non-prefetching cache and about 18% in prefetching cache. However, when
we continue increasing size of DRAM from 16MB to 64MB, the performance does not increase as
much as size increases. Therefore, it is reasonably to know that DRAM based caches with 1MB
26
and 16MB DRAM arrays and only 256KB SRAM buffer should be cost-effective choices for most
benchmarks depending on respective expected performance gain. Figure 10 shows the performance
improvement of all selected benchmarks on 16M DRAM cache with 256K SRAM buffer. In average
benchmark programs benefit from DRAM-based cache and prefetching but prefetching does not work
for benchmark���� and����. It can be explained by the low prefetch accuracy in����. Figure
11 shows the prefetch accuracy of hole-searching prediction and stride prediction of all benchmarks.
Prefetch accuracy means the percentage of prefetched cache lines which are really referenced before
evicted from SRAM buffer. Thus only prefetched cache line referenced in SRAM can be counted as
an accurate prefetching. It is easy to see the the prefetch accuracy of���� is extremely low and it
is 1.7% for both prediction schemes. This is because the frequent cross page reference in���� and
due to the limited size of SRAM buffer, most referenced page are evicted from SRAM buffer before
another reference falls into this page. Low prediction may reduce system performance due to the
cache pollution. For other benchmarks, the difference of prefetch accuracy between two prediction
schemes are limited within 5%.
5.3 Fast SRAM Buffer
We also explore performance impact of different size of SRAM buffer. Obviously, the more SRAM
buffer the DRAM-based cache uses, the more cache hits are caught by SRAM buffer and thus increase
system performance since data in SRAM can be served immediately.
Figure 12 shows the delayed hit rate comparison in 1M four-bank, four-way associative dram-
27
Figure 10: 16M DRAM w/ 256K SRAM
based cache system with different size of SRAM buffer. Here delayed hit rate is the average of
all selected benchmarks we use in simulation. The cache page size is 4KB and cache line size is
128Bytes. We simulate different size of SRAM buffer from 16KB to 256KB. In the lowest end,
SRAM buffer is only 16KB and each of four banks of DRAM arrays has only 4K SRAM buffer
associated. This is the size of cache page so in this case, there is only one cache page buffered in
SRAM at any time. In figure 12 the Delayed Hit Rate��������is defined as following
�������� �����������
���������� �����������
From figure 12, we can see that when SRAM buffer is 64KB or more, the delayed hit rate is reduced to
less than 4%. In particular, delayed hit rate is only 3.31% for 64KB SRAM and it decreases to 1.68%
when we increase its size to 256KB. However, if SRAM size is less than 64KB, the delayed hit rate
jumps more than 10%. In the case of 32KB SRAM which means the SRAM buffer associated with
28
Figure 11: Prefetching Accuracy
each bank can only hold two cache pages, the delayed hit rate is 11.9% and if each bank of DRAM
is associated with SRAM holding only one cache page, the delayed hit rate increases to 32.7%. Thus
we can see that64KB SRAM buffer is enough to catch more than 96% of cache hits from L1 cache
and only less than 4% of hits requires cache page transfer from DRAM arrays to SRAM buffer.
In the case of 64K, each bank of DRAM arrays is associated with 16KB SRAM buffer which can
only hold 4 cache pages assuming the cache page is 4KB but actually the DRAM-based cache with
small 64KB SRAM buffer works quite well. Figure 13 shows the performance comparison between
different size of SRAM buffer. In figure 13 the performance compact of different size of SRAM
is apparent. Using 64KB SRAM buffer, the IPC normalized over baseline architecture is 2.18 but
it is only improved by less than 1% to around 2.20 if we use 256KB SRAM. On the other hand,
performance drops apparently when we use SRAM buffer smaller than 64KB and there is only trivial
29
Figure 12: Delayed Hit Rate
performance difference between 32KB and 16KB SRAM buffer.
5.4 Prediction Schemes Comparison
Because in DRAM-based cache subsystem the history based prediction schemes do no longer hold,
we simulate other two prediction schemes, one is aggressively searching the holes in cache page and
the other is to catch the constant stride between past two consecutive references and use this stride to
make prediction. In the second case, if there is no such stride in past references, we turn back to the
hole-searching prediction scheme to generate prefetching candidate. From figure 9 and figure 10 it is
easy to see that there is no apparent performance difference between these two prediction schemes.
Using stride prediction scheme does not provide much performance improvement over simply looking
30
Figure 13: Normalized IPC
for holes in cache page. To address this problem, we define rate of stride prediction as
��������� �������������������������������
����������������������������(2)
Following figure 14 shows for 16M 4-bank 4-way associative DRAM based cache with 256K SRAM
buffer, the��������� of all ten benchmarks. From figure 14 we can see that in average stride
prediction only accounts for less than 1% of all prefetching candidates. Benchmars��� has most
stride prediction rate as only 4.4%. That is because for each cache reference, no matter it is a cache
hit or cache miss, hole-searching prediction will generate a prefetching candidate if there is any hole
existent in cache page and there is no constant stride among current and past two references. It
is different from conventional cache in which the one-block looking ahead prediction scheme only
probes limited times, if the candidate is already in cache, it does not issue a prefetching to next level
of memory, while in DRAM-based cache given there is any hole in cache page, a prefetching will be
31
issued. Since stride predicted candidate accounts only less than 1% of all candidates, it is reasonable
that there is no much performance difference between the two prediction schemes we use.
Figure 14: Stride Prediction Rate
6 Conclusion and Future Work
With the widening speed between fast modern processors and relatively slow memory system, it is
more important than ever to reduce miss rate of cache system due to the long miss penalty. Cost
and energy constraints limit the usage of large on-chip or off-chip cache subsystem. In this project
we propose a DRAM-base cache design which uses cheap and slow DRAM to make cost effective
cached data storage. To speed up data access, we use small but fast SRAM buffer associated with
each bank of DRAM arrays. Our simulation results show that most data accessed can be served by
32
the SRAM buffer rather than DRAM arrays. This cache subsystem provides the same interface to both
processor and next level of memory as any conventional cache. In addition, we employ DRAM-page
based data prefetching between this cache and main memory to further improve system performance
and prefetching schemes are simple and require very limited of extra storage thus they are easy to
be integrated into cache controller. Our simulation results indicate that this design is practical and
quite effective. As for the future work, in this project we only simulate an internal on-chip level
two DRAM-based cache. But this design can also be used in external cache systems. The current
benchmarks suits we use turn out that only 1M of cache is enough to provide high hit rate for most
benchmarks. Therefore, performance improvement brought by prefetching and even larger DRAM-
based cache such as 16M or 64M is not apparent using current benchmark suites. We expect to do
simulation on more benchmarks which requires bigger working set than current programs to have
more insight of this design. And, we also want to explore the opportunity of inner prefetching within
the cache, that is, prefetching cache page from DRAM arrays to SRAM buffers to reduce the misses
in SRAM. Obviously inner prefetching only makes sense when misses in SRAM is high but current
benchmarks we have do not show such characteristics in simulation.
33
References
[1] T. Alexander. A Distributed Predictive Cache for High Performance Computer Systems. PhD
thesis, Department of Computer Science, Duke University, 1995.
[2] T. Alexander and G. Kedem. Distributed predictive cache design for high performance memory
system. InProceedings of the Second International Symposium on High-Performance Computer
Architecture, pages 254–263, 1996.
[3] J. L. Baer and T. F. Chen. An effective on-chip pre-loading scheme to reduce data access penalty.
Supercomuting, 1991.
[4] Dave Bursky. Fast drams can be swapped for sram caches.Electronic Design, pages 55–56,60–
67, July 1993.
[5] Tien-Fu Chen. Data Prefetching for High Performance Processors. PhD thesis, Computer
Science and Engineering Department, University of Washington, 1993.
[6] Tien-Fu Chen. A performance study of software and hardware data prefetching schemes. In
Proceedings of the 21st International Symposium on Computer Architecture, pages 223–232,
1994.
[7] Pomerene et. al. Prefetching system for cache having a second directory for sequentially ac-
cessed blocks. US Patent 4,807,110 1989.
34
[8] Charles Hart. Cdram in a unified memory architecture. InProceedings of Spring CompCon,
pages 261–266, 1994.
[9] Y. Nakajima S. Utsugi M. Hamada I. Naritake, T. Sugibayashi. A 12ns 8mb dram secondary
cache for a 64b microprocessor.IEEE Journal of Solid-State Circuits, 35(8):1153–1157, 2000.
[10] Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. InProceedings of the
24th International Symposium on Computer Architecture, pages 252–263, 1997.
[11] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-
associative cache and prefetch buffers. InProceedings of the 17th International Symposium on
Computer Architecture, pages 364–373, 1990.
[12] G. Kedem and Ram P. Koganti. Wcdram: A fully associative integrated cached-dram with wide
cache lines. InProceedings of the 11th Annual International Symposium on High Performance
Computing Systems, July 1997.
[13] Ram P. Koganti. Wcdram: A fully associative integrated cached-dram with wide cache lines.
Master’s thesis, Department of Computer Science, Duke University, 1997.
[14] Ray Ng. Fast computer memories.IEEE Spectrum, pages 36–39, October 1992.
[15] Alvin R. Lebeck Mark D. Hill Richard E. Kessler, Richard J. Jooss. Inexpensive implemen-
tations of set-associativity. InACM/IEEE International Symposium on Computer Architecture,
1989.
35
[16] Alan J. Smith. Cache memories.Computing Surveys, 14(3), September 1982.
[17] S. K. Reinhardt W.-F. Lin and D. Burger. Designing a modern memory hierarchy with hardware
prefetching.IEEE Transactions on Computers, 50(11):1202–1218, November 2001.
[18] W.A. Wong and J.-L. Baer. Dram caching. Technical report, 97-03-04, Department of Computer
Science and Engineering, University of Washington, 1997.
[19] Haifeng Yu and Gerson Kedem. Dram-page based prediction and prefetching. InProceedings
of International Conference on Computer Design, September 2000.
36