16
Performance Evaluation 49 (2002) 283–298 Shared cache architectures for decision support systems Michel Dubois a,, Jaeheon Jeong b , Ashwini Nanda c a Department of Electrical Engineering—Systems, University of Southern California, 3740 McClintock Avenue, Los Angeles, CA 90089-2562, USA b IBM, Beaverton, OR 97006, USA c IBM T.J. Watson Research Center, Yorktown Heights, NY, USA Abstract In this paper we evaluate two shared-cache architectures for small-scale multiprocessors. We vary shared cache sizes from 8MB to 1GB, under various block sizes, cache organizations and sizes, and strategies for IO transactions. We use 12 bus trace samples obtained during the execution of a 100GB TPC-H on an eight-way multiprocessor. To deal with the cold-start misses at the beginning of each sample, we identify the sure misses which are known to be misses in the full trace. The difference between the total number of misses and the number of sure misses is the zone of uncertainty, which may be hits or misses in the full trace. It turns out that the zone of uncertainty is small enough in most cases that useful conclusions can be drawn. Our conclusions are that a single-cluster configuration with a shared cache—even a very small one—can be very effective for TPC-H. We also show that the coherence traffic between shared caches in a multiple cluster system is very high in the context of TPC-H. © 2002 Elsevier Science B.V. All rights reserved. Keywords: Cache memory; Trace-driven simulation; Cold-start bias; TPC-H; IO strategy 1. Introduction The design of large-scale servers must be optimized for commercial workloads and web-based appli- cations. These servers are high-end, shared-memory multiprocessor systems with large memory hierar- chies, whose performance is very workload dependent. Realistic commercial workloads are hard to model because of their complexity and their size. Trace-driven simulation is a common approach to evaluate memory systems. Unfortunately, storing and uploading full traces for full-size commercial workloads is practically impossible because of the sheer size of the trace. To address this problem, several techniques for sampling traces and for utilizing trace samples have been proposed [2,4,5,7,9]. Corresponding author. E-mail address: [email protected] (M. Dubois). 0166-5316/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved. PII:S0166-5316(02)00135-9

Shared cache architectures for decision support systems

Embed Size (px)

Citation preview

Performance Evaluation 49 (2002) 283–298

Shared cache architectures for decision support systems

Michel Duboisa,∗, Jaeheon Jeongb, Ashwini Nandaca Department of Electrical Engineering—Systems, University of Southern California,

3740 McClintock Avenue, Los Angeles, CA 90089-2562, USAb IBM, Beaverton, OR 97006, USA

c IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

Abstract

In this paper we evaluate two shared-cache architectures for small-scale multiprocessors. We vary shared cache sizes from8MB to 1GB, under various block sizes, cache organizations and sizes, and strategies for IO transactions. We use 12 bus tracesamples obtained during the execution of a 100GB TPC-H on an eight-way multiprocessor.

To deal with the cold-start misses at the beginning of each sample, we identify the sure misses which are known to be missesin the full trace. The difference between the total number of misses and the number of sure misses is thezone of uncertainty,which may be hits or misses in the full trace. It turns out that the zone of uncertainty is small enough in most cases that usefulconclusions can be drawn.

Our conclusions are that a single-cluster configuration with a shared cache—even a very small one—can be very effectivefor TPC-H. We also show that the coherence traffic between shared caches in a multiple cluster system is very high in thecontext of TPC-H.© 2002 Elsevier Science B.V. All rights reserved.

Keywords:Cache memory; Trace-driven simulation; Cold-start bias; TPC-H; IO strategy

1. Introduction

The design of large-scale servers must be optimized for commercial workloads and web-based appli-cations. These servers are high-end, shared-memory multiprocessor systems with large memory hierar-chies, whose performance is very workload dependent. Realistic commercial workloads are hard to modelbecause of their complexity and their size.

Trace-driven simulation is a common approach to evaluate memory systems. Unfortunately, storingand uploading full traces for full-size commercial workloads is practically impossible because of thesheer size of the trace. To address this problem, several techniques for sampling traces and for utilizingtrace samples have been proposed[2,4,5,7,9].

∗ Corresponding author.E-mail address:[email protected] (M. Dubois).

0166-5316/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved.PII: S0166-5316(02)00135-9

284 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

Fig. 1. Target multiprocessor systems with shared cache: (a) single-cluster system; (b) two-cluster system.

For this study, we have collected time samples of bus transactions in an actual multiprocessor machinerunning a 100GB TPC-H[8] using the Memory Instrumentation and Emulation System (MemorIES)board developed by IBM[6]. MemorIES was originally designed to emulate memory hierarchies inreal-time and is plugged into the system bus of an IBM RS/6000 S7A SMP system running DB2 underAIX. This system can run database workloads with up to 1TB of database data.

The two-target systems that we evaluate with the TPC-H time samples are shown inFig. 1. Theyare both eight-processor bus-based machines with 24GB of main memory. Each processor has 8MBof second-level cache (four-way, 128-byte blocks). Cache coherence is maintained by snooping on ahigh-speed bus implementing an invalidation-based MESI protocol.

The single-cluster system includes a shared cache[11], whose goal is to cut down on the effectivelatency of the large and slow DRAM memory. The shared cache benefits from sharing, prefetching, andspatial locality of L2 cache misses so that even very small shared caches can be very effective. Thedisadvantages are that L2-miss penalties are only partially reduced and that the shared cache does notbring any relief to bus traffic. We vary the shared cache sizes from 8MB to 1GB.

In some environments, packaging or bus cycle time constraints may favor a multiple cluster system.Therefore, the second system we evaluate is a two-cluster system. Each cluster is made of four processorsand has a shared cache. The two shared caches are maintained coherent through a second-level snoopy busimplementing the same protocol as the first-level buses. We vary shared cache sizes from 32 to 512MBso that the aggregate amount of shared cache varies from 64MB to 1GB.

We look at the effects of cache sizes, cache block sizes and cache organizations in both systems.Moreover, since IO transactions are a significant fraction of all bus transactions, we evaluate variousstrategies for handling IO bus references in the shared caches—IO-invalidate, IO-update and IO-allocate.

The work presented in this paper is unique because the evaluations are done with actual data obtainedfrom a realistic TPC-H database size. Previous results were based on simulations and drastically reduceddatabase sizes and relied on arguments to scale down the caches and the workload.

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 285

The rest of this paper is organized as follows. InSection 2, we present the tracing tool obtainedby configuring MemorIES, the trace sampling scheme, and the characteristics of the trace samples. InSection 3, we describe the target shared-cache architectures. InSections 4 and 5, we show the impactof the cold-start effect and classify misses, in cold, coherence and replacement misses. We also identifythe set of sure misses and, based on this classification, we are able to draw some conclusions on theeffectiveness of the two shared cache architectures in the context of TPC-H.Section 7reviews relatedwork. Finally we conclude inSection 8.

2. TPC-H traces

In this section we explain briefly how the time samples of TPC-H were obtained and show the propertiesof the samples across processors. More details can be found in[3,6].

2.1. Tracing environment

The IBM MemorIESwas designed to evaluate trade-offs for future memory system designs in multi-processor servers. The MemorIESboard sits on the bus of a conventional SMP and passively monitorsall the bus transactions, as shown inFig. 2.

The host machine is an IBM RS/6000 S7A SMP server, a bus-based shared-memory multiprocessor.The server configuration consists of eight Northstar processors running at 262 MHz and a number of IOprocessors connected to the 6xx bus operating at 88 MHz. Each processor has a private 8MB four-wayassociative L2 cache. The cache block size is 128 bytes. The size of main memory is 24GB.

A finely tuned 100GB TPC-H benchmark[8] runs on top of DB2 under AIX. Its execution takes 3 dayson the server. Although MemorIEShas been conceived for online cache emulation, we have configuredit as a tracing tool. In this mode, MemorIEScaptures bus transactions in real time and stores them intoits on-board memory. Later on, as the on-board memory fills up, the trace is uploaded to a disk. With its1GB of SDRAM, MemorIEScan collect up to 128MB bus transaction records without any interruption,since each trace record occupies 8 bytes.

Fig. 2. Operating environment of MemorIES.

286 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

Fig. 3. Time sample collection.

2.2. Trace samples

Although MemorIEScan store up to 128MB references per sample, the size of each sample collectedfor this study is 64MB references. Currently, with the limited IO capability of the board, it takes about2 h to upload a sample of 64MB references (512MB) to disk.

Our trace consists of 12 samples with roughly 2 h between samples and its overall size is 6GB. Thetrace samples were taken during the first day of a 3-day execution of a 100GB TPC-H. Each samplerecords up to 10 min of execution. The collection of samples is illustrated inFig. 3.

Table 1shows the bus reference counts of every processor and for every significant bus reference typein the first trace sample. The significant bus references are: reads, writes (read-excl, write/inv, upgrades),and write/flushes (essentially due to IO-write).Table 1shows that the variations of the reference countamong processors (excluding IO processors) are very small. We found similar distributions in the rest ofthe samples. In the first sample, processor reads and writes (including upgrades) contributes to around50 and 25% of all bus transactions, respectively. The references by IO processors are dominated bywrite/flush and make up for 25% of bus references.

IO-writes are caused by input to memory. On each IO-write, the data block is transmitted on the busand the processors’ write-back L2 caches are invalidated to maintain coherence with IO. Since this inputdata will most likely be needed by some processor(s) at a later time, the invalidation by IO will cause a

Table 1Breakdown of significant transaction counts in the first trace sample across processors

PID Read Read-excl Write/inv Upgrade Write/flush

0 3944601 221999 295218 1629223 671021 4001991 217725 270918 1685160 647242 4175661 208862 303913 1742245 654063 3908973 213590 286916 1610544 664244 4101762 209785 254847 1640377 660635 3932938 217326 314963 1580184 656866 4166149 192101 248305 1821616 651437 3851738 200694 225611 1654915 64449IO 978887 0 12 0 16232121

Total 33062700 1682082 2200703 13364264 16757118Percentage 49.3 2.5 3.3 19.9 25

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 287

bus miss transaction by one of the processors later on. This effect is not critical when the caches are smallenough so that they cannot retain the data between the time of an IO-write and the time of the processoraccess. However the write/flushes due to I/O may be quite costly in the context of very large caches(such as shared L3 caches) which are able to retain data blocks for long periods of time. This observationhas prompted us to look at other strategies for handling IO transactions in the shared caches, besidesIO-invalidate. We also consider updating or allocating the block in the shared caches on an IO-write.

3. Target shared cache architectures

We focus on the evaluation of two L3 shared-cache architectures in an eight-processor SMP, bothshown inFig. 1. Fig. 1(a) shows a single-cluster system with a single shared cache. There is no need tomaintain inclusion between the shared cache and the processor L2 caches, as the cache is shared by allprocessors. The idea of a shared cache is not new as it was proposed and evaluated in[11], in a differentenvironment and using analytical models. The goal of a shared cache is to cut down on the latency ofDRAM accesses to main memory. On every bus cycle, the shared cache is looked up and the DRAMmemory is accessed in parallel. If the shared cache hits, then the memory access is aborted. Otherwise,the memory access completes and the shared cache is updated. LetH be the cache hit rate and letLcache

andLDRAM be the latency of a shared cache hit and of a DRAM access. Then the apparent latency of thelarge DRAM memory is(1− H)LDRAM + HLcache. The shared cache can have a high-hit ratio even if itssize is less than the aggregate L2 cache size in the multiprocessor, because of sharing, prefetching, andspatial locality effects. If the main (DRAM) memory is banked and interleaved then a separate sharedcache bank can be used in conjunction with each memory bank. This approach effectively distributesthe miss traffic between shared cache and main memory across several short unshared busses. In thissingle-cluster configurations we vary the shared cache size from 8MB to 1GB. As we will see the 8MBshared cache is quite effective, even if the aggregate size of all L2 caches is 64MB.

Fig. 1(b) shows a two-cluster system with shared caches. Each cluster has four processors. This par-titioning may be dictated, for example, by packaging constraints or bus speed constraints. Coherencebetween the two shared caches is maintained by a snooping protocol on a second-level bus implementingthe same MESI protocol as the first-level bus. Inclusion between the L2 caches and the shared cachesimplifies the design greatly and reduces the traffic on the first-level busses, and so we assume inclusionin this case. We will vary the size of each shared cache between 32 and 512MB so that its size is atleast the aggregate size of the L2 caches connected to it. Whenever we compare a single-cluster systemto a two-cluster system, we always use the same aggregate amount of shared caches. So a two-clustersystem with 512MB of shared cache in each cluster is compared with a single-cluster system with 1GBof shared cache. The major issue with the two-cluster system is the coherence traffic on the second-levelbus. Contrary to the single-cluster system in which memory can be organized in separate memory banks,all the traffic in the two-cluster system between shared caches and main memory must be supported bythe second-level bus.

IO transactions can have a huge impact on shared-cache performance. The usual way to deal withIO transactions is to simplyinvalidate the shared caches on each IO transaction detected on the bus.However, other policies are possible, since an input transaction carries the data with it. The second policyis IO-update, whereby, if the shared cache hits, then the block is updated instead of invalidated. The thirdpolicy is IO-allocate, whereby misses due to IO cache accesses allocate a block in the cache. Because

288 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

allocating means taking a miss in the shared cache, the control for the cache may be more complex,especially when the IO transaction does not carry the values for the entire block. These two strategiesmay pay off if the shared cache is big enough to retain database record between the time it is input andthe time it is accessed.

In each shared cache system we simulate, we treat each bus reference as follows:

• Processor requests:◦ Read: On a hit in the shared cache, we update the block LRU statistics. On a miss the data is loaded

in cache.◦ Read-excl, write/inv, upgrade: Same as read. In the two-cluster system the other shared cache is

invalidated.◦ Write/flush: We invalidate the block in all shared caches.

• IO requests:There are three types of IO requests: IO read, IO-write and IO-write/flush. By far writeflushes are the dominant request type. Different policies are evaluated for IO-write/flushes (IO readsand writes always invalidate the shared cache). In IO-invalidate, the shared cache(s) is(are) invalidatedon all IO references. In IO-update, the block is updated in the case of a write hit and its LRU statistics areupdated. In IO-allocate IO-write/flushes are treated as if they were coming from execution processors.

The miss rates reported in this paper do not include misses due to IO since they are not on the criticalpath of processors’ execution.

4. Shared cache warm-up rate

Fig. 4 shows the average fraction of warm blockframes as a function of the reference number acrossall 12 samples for the two clustering schemes and a block size of 128 bytes. We simulate each tracesample starting with empty caches. A blockframe is deemed warm after a block has been allocated to it

Fig. 4. Shared cache warm-up rate (128-byte blocks).

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 289

for the first time. We always allocate an invalidated blockframe first. For every number of references wecalculate the average number of warm blockframes across all samples. The aggregate shared cache sizein both systems varies from 64MB to 1GB. From the graphs, we see that shared caches with an aggregatesize larger than 128MB are never completely warm with 64MB references, although the caches in thetwo-cluster system warm-up a bit faster than in the single-cluster system. Shared caches with larger blocksizes fill up faster, but we still observe a large error due to the cold-start effect in each sample for thelarger cache sizes.

5. Shared-cache miss rates

To evaluate the performance of shared cache architectures with the trace samples, we use the miss rateas a metric. Misses are counted in each sample with a cold cache at the beginning of each sample. Sincethe number of references is the same in every sample, we can simply divide the total number of missesobserved in all time samples by the total number of bus references in the samples.

5.1. Miss rate classification

We decompose misses in the shared caches into the following four categories: cold misses, IO-Cohmisses (misses due to IO-invalidation—these misses are only present when IO transactions cause cacheinvalidations), P-Coh misses (misses due to intercluster invalidations—these misses are only present inthe two-cluster systems), and replacement misses.

To understand the cold-start bias better, cold misses are further classified as cold–cold and cold–warmmisses according to whether the cache set is warm (full) at the time of the cold miss. Any miss on a firstaccess to a block in the sample which is preceded in the sample by an invalidation from IO or by a writeby a processor in the other cluster is classified as a IO-Coh or a P-Coh miss, respectively. Therefore, tobe classified as a cold miss, a bus reference must be the first access to the block in the sample and in acluster, and moreover, it cannot be preceded in the same sample by an IO-invalidation or by a write by aprocessor in the other cluster. The classification algorithm is shown inFig. 5.

The reason for separating cold misses in warm sets is that cold–cold misses in the samples may be hitsin the full trace. For this reason, cold–cold misses have been calledunknownreferences in[10]. On theother hand cold–warm misses aresuremisses because they are known to miss in the full trace for sure,although their classification in the full trace is unknown. Like cold–warm misses, P-Coh, IO-Coh, andreplacement misses are sure misses, but, additionally, they also have the same classification in the fulltrace as in the samples.

We also count Upgrade misses. Upgrade misses are not part of the classification ofFig. 5. Upgrademisses are any kind of misses in the shared caches caused by upgrade bus references. Upgrade missesdo not occur in L3 caches when cache inclusion is maintained between the processors’ L2 caches andthe shared (L3) caches. With our bus traces, however, we cannot maintain L2/L3 inclusion and Upgrademisses can occur in the L3 cache. Therefore, strictly speaking, the simulation results presented in thispaper can be considered correct only if L2/L3 inclusion is not maintained. However, in all the evaluationswe have made, we have observed that Upgrade misses are very few as compared to other misses. If wewere to show the contribution of Upgrade misses to shared cache misses in the plots ofFig. 6, they wouldhardly show in all cases. This shows that the effects of non-inclusion between L2 caches and L3 sharedcaches are negligible. This observation has two implications:

290 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

Fig. 5. Flowchart of the miss classification algorithm.

• We do not expect much of a boost to the miss rate of the shared cache because of non-inclusion.• The results for the two-cluster system in which inclusion is required are approximate, but this error

due to this approximation is very small.

Fig. 6shows the classification of misses in the shared cache of the single-cluster and of the two-clustersystems for cache blocks of 1KB and four-way set-associative caches and for each of the three policiesfor handling IO.

In the single-cluster system (Fig. 6(a)) smaller caches are dominated, as expected, by a large number ofreplacement misses. Remember that we do not need inclusion in this case. For small cache sizes the impactof invalidations from IO is very small and the policy adopted for IO references does not matter much.

However, for very large caches, misses due to IO coherence dominate in the system with IO-invalidate.When the shared cache reaches 1GB, the IO-allocate strategy is very effective as practically all the missesdue to IO are eliminated. Finally, even a small 8MB shared cache is quite effective, with a miss ratebelow 25% for a block size of 1KB. Note that the main memory and the shared cache can be bankedand interleaved so that the traffic between shared cache and memory can be spread over several shortnon-shared busses.

Fig. 6(b) shows the miss classification for the two-cluster system. The miss rate is dominated here byinvalidation misses due to intercluster coherence (P-Coh misses). This component does not decrease with

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 291

Fig. 6. Miss rate classifications: (a) single-cluster system; (b) two-cluster system.

the cache size and thus larger caches are not very effective at cutting the overall miss rate. Because ofthe dominance of intercluster invalidation misses, the miss rate component due to IO remains relativelysmall, even for 1GB caches, and therefore, IO-update and IO-allocate policies are not very effective. Witha block size of 1KB, and a miss rate of 15%, the miss traffic on the second-level bus is significantly higherthan the total traffic on both first-level busses where the block size is 128 bytes. If a choice must be made,we should pick the smallest block size, i.e. 128 bytes, and the smallest cache size (i.e. 64MB) since, inthis case, the miss rate in the shared caches is less than 25%, and therefore the traffic on the second-levelbus is much less than on the two first-level busses.

Unfortunately, as the cache size increases, cold–cold misses become a dominant part of the miss rate,and the large number of cold–cold misses in very large caches means that the error due to cold start issignificant in these systems, and we need to visualize this uncertainty.

292 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

5.2. Sure misses

An upper bound on the miss rate in the samples without cold-start bias is the total miss rate, i.e. the ratioof the number of all misses in the samples starting with cold caches and the number of bus references inthe samples. This upper bound is displayed inFig. 6. Thesure miss rateof the samples is a lower boundon the miss rate of the samples without cold-start bias. It is the ratio of the number of sure misses inthe samples starting with cold caches and the number of bus references in the samples. The sure missesare the cold–warm misses, the coherence misses, and the replacement misses. They exclude cold–coldmisses. Cold–cold misses were deemed unknown references in[4], because they may be hits or missesin the full trace. If the distribution between misses and hits in the unknown references were the same asin the full trace, then the sure miss rate would be equal to the miss rate of the samples in the full trace.However, renewal theory arguments as well as empirical evidence shows that the fraction of misses in the

Fig. 7. Effect of block size: (a) single-cluster system; (b) two-cluster system.

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 293

unknown references is much higher in the full trace than the fraction of sure misses in the samples[4].In this work we take a very conservative approach, and we simply consider the sure miss rate as a lowerbound on the miss rate of the samples in the full trace. Thezone of uncertaintyis the difference betweenthe total miss rate and the sure miss rate.

In the IO-invalidate case, the sure miss rate can be decomposed into the sure miss rate due to IO andother sure miss rate. This is shown inFig. 7, which displays the effect of cache block sizes in varioussystems. The white component at the top of each stack bar is the zone of uncertainty. It shows the errorrange due to the cold-start bias in each sample.

The black line superimposed to the stack bar charts inFig. 7 shows the mean between the total missrate and the sure miss rate. The zone of uncertainty is small whenever the cache is small and/or when theblock size is large. Obviously, for the single-cluster system (Fig. 7(a)), there is a large amount of spatiallocality in the shared cache, and so the performance is better for larger block sizes. The best block size is4KB, which is also the page size. Since the shared cache is physically addressed, block sizes greater thanthe page size cannot capture any additional spatial locality. In all cases, the miss rate is minimized and isless than 20% for a block size of 4KB. In the case of a 1GB shared cache, the misses due to IO dominateand the IO-allocate strategy is effective at eliminating the misses due to IO; however, for the best blocksizes (between 1 and 4KB) the miss rates are already very small, less than 5%, and so the overall impacton processor performance may be low.

The miss rate in the two-cluster system is dominated by intercluster coherence. This component does notchange much with the block size, which shows that there is little false sharing between clusters. Increasingthe cache size is futile, and a system with aggregate shared cache of 64MB looks like the best choice.

Fig. 8 focuses on the impact of the strategy for IO bus references in the single-cluster system. Whenthe shared cache size is small (8 or 64MB) the strategy used for IO bus requests does not make muchdifference, for both block sizes of 128 bytes or 1KB. By contrast, in a 1GB cache, we see that the strategy

Fig. 8. Miss rates comparison between various cache strategies for IO references in the single-cluster system.

294 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

Fig. 9. Effect of cache organization and block size in the single-cluster system.

applied to IO requests might have a large impact, although the data is not conclusive in this case as thezone of uncertainty is large. In the case of 1GB caches with 1KB block size the miss rate is so small (afew percent) that the effect on processor performance may be insignificant.

Finally Fig. 9shows the effect of the block size and the cache organization in the single-cluster system.We see again that the block size has a huge impact on the miss rate but that the organization of the cachedoes not matter much. Obviously conflict misses are few and far between. This is a reliable conclusiongiven the small size of the zones of uncertainty in all cases.

6. Variation across samples

One important question is whether the 12 time samples are representative of the whole execution. Wedo not have a scientific answer to this question. However, we can look at the differences between thesamples taken at 2 h intervals to see whether there are wide variations during the first day of execution ofTPC-H.

Table 2shows the reference counts of the significant transaction types in each of the 12 trace samples.In the 12 samples, processor reads and writes (including upgrades) contributes to around 44 and 21% ofall bus transactions, respectively. Thirty five percent of all bus transactions are due to IO-writes.

We find some noticeable variations for some reference counts across trace samples. The sixth sampleseems to be very different than the other samples: it has a lot more write transactions (read-excl andwrite/inv) and much less upgrades than the rest of the samples. Otherwise, event counts are quite uniformacross samples.Fig. 10shows the variation of the total miss rate across the 12 trace samples for the twoclustering scheme and various aggregate cache sizes. It tends to confirm that sample six is an outlier.However, even after removing sample six from the evaluations in this paper, we have verified that theobservations made in this paper are not really affected.

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 295

Table 2Reference count per significant bus transaction type in the 12 trace samples

Sample Read Read-excl Write/inv Upgrade Write/flush

1 33062700 1682082 2200703 13364264 167571182 29574629 1446244 2677942 10059975 233040213 28717853 1335718 2835185 10637496 235473084 27727985 1405457 2900043 8734182 262736385 25799000 1774029 3067320 9517591 268995856 28760710 4021657 6077446 6367359 218373537 29087095 1959379 2953071 7265500 257852778 28830726 1929086 2866096 7312608 261111169 26092006 1970931 3206843 7186243 28594783

10 26899985 1768684 3279821 7148337 2796127011 36070868 1079535 2009423 10462970 1739002012 36516311 1108966 2159616 10631200 16445111

Average (%) 44.35 2.66 4.50 13.50 34.88

7. Related work

Work related to this paper falls in two categories: trace sampling and evaluation of commercial work-loads.

Puzak[7] introduced trace stripping and set sampling. Trace stripping is based on a filter cache andthe references that miss in the filter cache are collected to accurately evaluate cache miss rates. Puzakalso showed that trace sampling using one-tenth of cache sets leads to a reliable estimation of miss rates.Wang and Baer[9] extended Puzak’s techniques using filter caches and proposed methods to evaluatemiss count, write-back count and invalidation count. Their method keeps in the trace all writes on cleanblocks in addition to the references that miss in the filter cache. Chame and Dubois[2] proposed processorsampling techniques based on cache inclusion for large-scale multiprocessor systems. They examinedwhether cache inclusion can be maintained for different set mapping functions and showed that the traces

Fig. 10. Variation of miss rate across the 12 trace samples (128-byte block).

296 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

of references to a small number of sets can be expanded for larger number of sets if caches are evaluatedusing stack algorithms.

Kessler et al.[4] compared a variety of trace sampling techniques based on set sampling and timesampling under the 10% sampling rule. They showed that the cold-start bias is the main reason behindthe inaccuracy of time sampling, and that set sampling outperforms time sampling.

Barroso et al.[1] studied the evaluation of commercial workloads using SimOS. They addressed thedifficulty of evaluating commercial workloads and presented results for TPC-D, whose behavior resemblesthe behavior of TPC-H. Due to their simulation approach, the database was scaled down to 12GB. Theyfound that the numbers of cold misses, coherence misses and replacement misses are comparable in thecontext of systems with 2MB board-level caches.

We do not know of any evaluation of large cache behavior using a realistic TPC-H database size, suchas the one used in this paper. Previous results were based on simulations and drastically reduced databasesizes and relied on arguments to scale down the caches and the workloads.

8. Conclusion

In this work, we have programmed the IBM MemorIESboard to take time samples of a 100GB TPC-Hrunning on a multiprocessor. Twelve time samples of 64MB bus references each were collected bymonitoring the memory bus of an eight-processor IBM RS/6000 S7A SMP server running DB2 underAIX during the first day of execution of TPC-H. Our TPC-H bus trace is mainly composed of reads,writes, upgrades and IO-writes. Each of these access types makes up around 44, 7, 14 and 35% of allreferences, respectively. The trace samples have been used to evaluate shared L3 cache architectures ina system with eight processors.

One of the major observation using the twelve 64MB reference trace samples is the dominance ofintercluster coherence misses, which may cause a huge amount of memory bus traffic in the two-clustersystem. At this point, we have no idea on how to reduce the number of coherence misses, although we aretrying to understand their nature and try to eliminate some of them. Small block sizes (e.g. 128 bytes) andsmall cache sizes (e.g. 64MB) are preferable because the miss rate does not improve much with the cachesize and because small blocks cause much less traffic on the memory bus. Unless the design decisionfor an L3 cache architecture is based on other factors besides the miss rate under TPC-H—such as otherworkloads or packaging constraints—the single-cluster is by far the best choice.

In systems with one cluster and larger caches, we must bring down the number of misses due to IOcoherence. To do this we considered updating and allocating in the shared cache instead of invalidatingit on an IO-write on the bus.

We have observed the following in the one-cluster system:

• Even small shared caches of 8MB (much less than the aggregate L2 cache size) are very effective, dueto sharing, prefetching, and spatial locality effects.

• Whether inclusion is enforced or not has little impact on shared cache miss rate.• With a 1GB cache and a block size of 1KB the shared cache miss rate is only a few percent.• A block size of 1KB is much better than a block size of 128 bytes. The optimum for the miss rate is

4KB, the page size.• The cache organization does not affect performance much. Thus a simple direct-mapped cache is

probably preferable. This indicates few conflicts in the shared cache.

M. Dubois et al. / Performance Evaluation 49 (2002) 283–298 297

• For large caches (e.g., 1GB) the strategy for handling IO requests (IO-invalidate, IO-update or IO-allocate) may have a large impact on the miss rate. However the error due to the sampling strategymakes this result unreliable. Moreover, the miss rate is so small in 1GB caches that the net effect ofthe IO strategy on processor efficiency may be negligible.

The accuracy of time sampling is affected by two factors: the sampling rate and the cold-start effect ineach sample. To improve the sampling rate we simply have to take more time samples. However we haveseen that there was little variations in the behavior of the samples taken across an entire day of TPC-Hexecution.

We have seen that the trace samples with 64MB references are not enough to warm-up L3 cachesystems with an aggregate amount of cache larger than 128MB. Cold-start effects in each sample resultin azone of uncertaintydue to cold–cold misses, which could be hits or misses in the full trace. The zoneof uncertainty is the difference between the total miss rate and the sure miss rate of the samples. The suremiss rate is based on cold–warm misses, coherence misses and replacement misses.

We are currently trying to narrow the zone of uncertainty to get more conclusive evidence. Since theMemorIESboard can emulate target cache systems in real time, we can use the time between samplesto emulate a set of caches with different architectures and fill these caches up before the next sample istaken so that we have the content of these caches at the beginning of each sample (calledcache snapshot).The content of these emulated caches is dumped with the sample at the end of each time sample. Byplaying with cache inclusion properties[9], the content of these few emulated caches can be used torestore the state of many different cache configurations at the beginning of each sample, thus eliminatingthe cold-start effect. In this framework, a trace collection experiment consists of several phases, repeatedfor each time sample: a phase in which we emulate the target caches to get the snapshots, the trace samplecollection phase, and the trace dump phase in which the cache snapshots and the trace samples are dumpedto disk. Once we have the trace with the cache snapshots we will be able to firm up the conclusions of thispaper for the very large cache sizes. However we do not feel that the conclusions of this paper will change.

We are also currently taking samples for other commercial workloads including TPC-C. Once we havesamples for various workloads we will be able to draw more general conclusions about the usefulnessand design of L3 cache architectures for future servers.

Acknowledgements

This work is supported by NSF Grant CCR-0105761 and by an IBM Faculty Partnership Award. Weare grateful to Ramendra Sahoo and Krishnan Sugavanam from IBM, Yorktown Heights who helped usobtain the trace. We also thank Shahin Jahromi and Mahsa Rouhanizadeh for helping with the preparationof the paper.

References

[1] L. Barroso, K. Gharachorloo, E. Bugnion, Memory system characterization of commercial workloads, in: Proceedings ofthe 25th ACM International Symposium on Computer Architecture, June 1998.

[2] J. Chame, M. Dubois, Cache inclusion and processor sampling in multiprocessor simulations, in: Proceedings of the ACMSigmetrics, May 1993, pp. 36–47.

298 M. Dubois et al. / Performance Evaluation 49 (2002) 283–298

[3] J. Jeong, R. Sahoo, K. Sugavanam, A. Nanda, M. Dubois, Evaluation of TPC-H bus trace samples obtained withMemorIES, in: Proceedings of the Workshop on Memory Performance Issues, ISCA 2001,http://www.ece.neu.edu/conf/wmpi2001/full.htm.

[4] R. Kessler, M. Hill, D. Wood, A comparison of trace-sampling techniques for multi-megabyte caches, IEEE Trans. Comput.43 (6) (1994) 664–675.

[5] S. Laha, J. Patel, R.K. Iyer, Accurate low-cost methods for performance evaluation of cache memory systems, IEEE Trans.Comput. 37 (11) (1988) 1325–1336.

[6] A. Nanda, K. Mak, K. Sugavanam, R. Sahoo, V. Soundararajan, T. Basil Smith, MemorIES: a programmable, real-timehardware emulation tool for multiprocessor server design, in: Proceedings of the Ninth International Conference onArchitectural Support for Programming Languages and Operating Systems, November 2000.

[7] T.R. Puzak, Analysis of cache replacement-algorithms, Ph.D. Dissertation, University of Massachusetts, Amherst, MA,February 1985.

[8] Transaction Processing Performance Council, TPC Benchmark H Standard Specification, June 1999,http://www.tpc.org.[9] W. Wang, J.L. Baer, Efficient trace-driven simulation methods for cache performance analysis, ACM Trans. Comput. Syst.

9 (3) (1991) 222–241.[10] D. Wood, M. Hill, R. Kessler, A model for estimating trace-sample miss ratios, in: Proceedings of the ACM Sigmetrics

Conference on Measurement and Modeling of Computer Systems, 1990, pp. 27–36.[11] P. Yeh, J. Patel, E. Davidson, Performance of shared cache for parallel-pipelined computer systems, in: Proceedings of the

10th ACM International Symposium on Computer Architecture, 1983, pp. 117–123.

Michel Dubois is a Professor in the Department of Electrical Engineering of the University of SouthernCalifornia. Before joining USC in 1984, he was a Research Engineer at the Central Research Laboratoryof Thomson-CSF in Orsay, France. His main interests are Computer Architecture and Parallel Processing,with a focus on Multiprocessor Architecture, Performance, and Algorithms. He has published more than100 papers in technical journals and leading conferences on these topics. Dubois holds a PhD fromPurdue University, an MS from the University of Minnesota, and an Engineering degree from the FacultéPolytechnique de Mons in Belgium, all in Electrical Engineering. He is a member of the ACM and a Fellowof the IEEE.

Jaeheon Jeong is a System Performance Engineer at IBM, Beaverton, OR. His research interests in-clude Computer Architecture and Performance Evaluation with a focus on Memory Hierarchy andShared-Memory Multiprocessors. He worked on several FPGA-based performance analysis tools includ-ing MemorIES. He received a BS in Electronics Engineering from Korea University, and an MS and aPhD in Electrical Engineering from the University of Southern California.

Ashwini Nanda is a Research Staff Member at IBM T.J. Watson Research Center, Yorktown Heights,NY, where he is currently a member of the Research Strategy Team. At IBM Research, he started theScalable Server Architecture group and managed the group for several years. Dr. Nanda’s research interestsinclude Computer Architecture and Shared-Memory System Design and Performance with an emphasison Commercial Applications. The research projects he led at IBM include Memory Instrumentation andEmulation System (MemorIES) High Throughput Coherence Controllers, and the Watson CommercialServer Performance Lab. He is a member of the architecture and design team for IBM’s next generationNUMA-Q machines. Prior to working at IBM, Dr. Nanda had worked on the Amazon Superscalar Processorat Texas Instruments, Dallas, and on a Message Passing Multiprocessor System at Wipro, Bangalore, for

India’s missile research program. Dr. Nanda has been a Co-General Chair of International Symposium on High PerformanceComputer Architecture (HPCA-7), serves on the Editorial Board of IEEE Transactions on Parallel and Distributed Systemsand is co-editing a special issue of the IEEE computer magazine. He has several patents and has published over 30 papers onComputer Architecture and Performance.