Studying inter-core data reuse in multicoreshostel.ufabc.edu.br/~cak/inf103/studying_inter... · on-chip cache topologies, and interconnect structures [5]. Unfortunately, in this

Studying Inter-Core Data Reuse in Multicores∗

Yuanrui Zhang, Mahmut Kandemir, and Taylan Yemliha{yuazhang, kandemir}@cse.psu.edu, [email protected]

Pennsylvania State University & Syracuse UniversityUniversity Park, PA 16802, USA

ABSTRACTMost of existing research on emerging multicore machinesfocus on parallelism extraction and architectural level opti-mizations. While these optimizations are critical, comple-mentary approaches such as data locality enhancement canalso bring significant benefits. Most of the previous data lo-cality optimization techniques have been proposed and eval-uated in the context of single core architectures. While onecan expect these optimizations to be useful for multicore ma-chines as well, multicores present further opportunities dueto shared on-chip caches most of them accommodate. In or-der to optimize data locality targeting multicore machineshowever, the first step is to understand data reuse charac-teristics of multithreaded applications and potential benefitsshared caches can bring. Motivated by these observations,we make the following contributions in this paper. First,we give a definition for inter-core data reuse and quantifyit on multicores using a set of ten multithreaded applica-tion programs. Second, we show that neither on-chip cachehierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploitavailable inter-core data reuse in multithreaded applications.Third, we demonstrate that exploiting all available inter-core reuse could boost overall application performance byaround 21.3% on average, indicating that there is significantscope for optimization. However, we also show that tryingto optimize for inter-core reuse aggressively without consid-ering the impact of doing so on intra-core reuse can actuallyperform worse than optimizing for intra-core reuse alone.Finally, we present a novel, compiler-based data locality op-timization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maxi-mize benefits that can be extracted from shared caches. Ourexperiments with this strategy reveal that it is very effectivein optimizing data locality in multicores.

∗This work is supported in part by NSF grants #1017882,#0963839, CNS #0720645, CCF #0811687, and CCF#0702519, and a grant from Microsoft Corporation.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMETRICS’11, June 7–11, 2011, San Jose, California, USA.Copyright 2011 ACM 978-1-4503-0262-3/11/06 ...$10.00.

Categories and Subject DescriptorsD.3.4 [Programming Languages]: Processors—compil-ers, optimization

General TermsDesign, Performance, Experimentation, Algorithms

1. INTRODUCTIONEmerging multicore architectures offer a single processor

package that contains two or more cores to enable paral-lel execution of multithreaded applications. These archi-tectures are currently replacing traditional single-core ma-chines that employ complex micro-architectures clocked atvery high frequencies. All major chip vendors today havetheir multicore products on the market [1, 2, 3, 4] and trendsindicate that the future multicores will have a large varietyof on-chip configurations, in terms of the number of cores,on-chip cache topologies, and interconnect structures [5].

Unfortunately, in this multicore era, application develop-ers can no longer rely on increasing clock speeds alone tospeed up their sequential applications. Instead, they mustbe able to design and implement their applications to exe-cute in a multithreaded environment. Clearly, the first stepalong this direction is to parallelize an application, i.e., cre-ating a multithreaded version of it. While proper paral-lelization is critical to achieve good performance in emerg-ing multicore architectures, this alone may not be sufficientfor many applications. To gain a competitive advantage,application developers must also be able to exploit on-chipcache hierarchies of these new architectures. These hier-archies come in a variety of forms but most include somesort of shared component which can be accessed by two ormore cores. Existence of shared last-level on-chip caches canplay a very important role in application behavior [29, 11,12, 46]. This is because missing in the last-level cache in amulticore architecture results in an off-chip memory access,which can be very costly from both performance and powerconsumption perspectives.

It needs to be noted that the performance of a multicorecache hierarchy is a function of both application character-istics and the underlying cache topology. As a result, it isof utmost importance to understand application character-istics regarding cache behavior and associate them with thecache topology information. Unfortunately, current state-of-the-art code analysis and optimization techniques for cachelocality may not be sufficient in this context since they weredeveloped in the context of single core machines which did

25

not have the concept of shared caches. One of the sig-nificant impacts of shared caches is that they enable twocores/threads to share data using this cache space, which inturn brings up the concept of inter-core data reuse. Thisnew reuse concept also brings with it the potential for awhole new set of optimization opportunities. Motivated bythis observation, we make the following contributions in thiswork:• We give a definition for inter-core data reuse and quan-

tify it using a set of ten multithreaded application programs.Our results show that, while intra-core reuse distances aregenerally short, inter-core reuse distances tend to be muchhigher. Consequently, they may require/benefit from differ-ent optimization strategies than those available today. Wefurther observe that both temporal and spatial inter-corereuses exhibit similar characteristics as far as reuse distancesare concerned.• We show that neither on-chip cache hierarchies of cur-

rent multicore architectures nor state-of-the-art code/dataoptimizations exploit available inter-core data reuse in mul-tithreaded applications. For example, while a three-layeron-chip cache hierarchy converts about 77.6% (51.8%) ofavailable intra-core temporal (spatial) reuses into locality(hits in L2 or L3 levels), the same architecture is successfulin converting only about 11.3% (4.1%) inter-core temporal(spatial) reuse into locality. Further, using a powerful set ofdata locality optimizations, we are able to convert only anadditional 3.3% (3.2%) inter-core temporal (spatial) reuseinto locality.• We demonstrate that exploiting all available inter-core

reuse could boost overall application performance by around21.3% on average, indicating that there is significant scopefor optimization. However, we also show that trying to op-timize for inter-core reuse aggressively without consideringthe impact of doing so on intra-core reuse can actually per-form worse than optimizing for intra-core reuse alone.• Motivated by the observation in the previous item, we

then present a novel, compiler-based data locality optimiza-tion strategy for multicores that balances both inter-core andintra-core reuse optimizations. Specifically, our approachimplements an integrated mapping and scheduling strategythat maximizes both “vertical” and “horizontal” data reusesconsidering the on-chip cache hierarchy of the target ar-chitecture. The key component of this strategy is an in-telligent weight assignment scheme that considers potentialdata reuses among different computation blocks. Our re-sults collected using ten applications on an Intel multicoremachine indicate that this new approach brings about 23.1%and 23.7% improvements in L2 and L3 cache hits, respec-tively, resulting in an average performance (execution time)improvement of 18.8%. Further, we study in detail severalparameters used in our optimization strategy, and show thatthe proposed strategy brings consistent benefits under thedifferent values of major experimental parameters.The rest of this paper is organized as follows. The next

section discusses on-chip cache behavior of multicores andSection 3 explains the concepts of data reuse, reuse distancesand data locality, focusing in particular on multicore spe-cific aspects. Section 4 describes our target architecturesand the application programs used for quantifying intra-coreand inter-core data reuses. Section 5 quantifies inter-coreand intra-core data reuses in our original applications. Next,Section 6 gives the evaluation of conventional cache manage-

0

L1

1

L1

2

L1

3

L1

L2

4

L1

5

L1

6

L1

7

L1

L2

Figure 1: Harpertown architecture.

0

L1 L1 L1 L1

L3

L2

L1 L1

L2 L2

1 2 3 4 5 6

L1 L1 L1 L1

L3

L2

L1 L1

L2 L2

7 8 9 10 11

Figure 2: Dunnington architecture.

ment and state-of-the-art data locality optimizations, Sec-tion 7 discusses the potential and cost of inter-core reuse op-timization, and Section 8 presents our compiler-based datalocality optimization strategy that carefully balances inter-core and intra-core data reuses. Section 9 discusses the datalocality improvements brought by our approach. Finally,Section 10 and Section 11 give related work and concludingremarks, respectively.

2. MULTICORE CACHE BEHAVIORIn a multicore machine, two or more cores are integrated

into a single circuit die. One of the distinguishing charac-teristics of multicore machines is their adoption of on-chipcaches. These caches provide fast access to frequently-useddata and thus help to reduce execution latencies. Many com-mercial multicore machines in the market today (e.g., [1, 3,5]) employ shared on-chip caches, where a number of corescan access the same on-chip cache component. This cachespace sharing, however, can be constructive or destructive[14]. In the former case, cores share data that reside in thesame cache line (block) in a very efficient manner using thisshared space. In the latter case, two cores can displace eachother’s data from the shared cache space, hurting applica-tion performance. Clearly, whether an execution experiencesconstructive sharing or destructive interferences depends onwhat threads are run on cores, what data they access, andwhat their data access/sharing patterns are.

Figures 1 and 2 show the high-level views of two commer-cial multicore architectures. Each of these machines has twosockets (delimited by dashed boxes in the figures). One ofthese machines, Intel Harpertown [22], has a two-level on-chip cache hierarchy (L1 and L2), and the other one, IntelDunnington [21], has three levels of on-chip caches (L1, L2and L3). In each of these multicore machines, performanceof the last-level cache (L2 in Harpertown and L3 in Dunning-ton) is very important since a miss in this cache can be verycostly (about 100 nsec in Harpertown and about 50 nsec inDunnington). Consequently, one of the goals in mapping anapplication to a multicore architecture is to maximize theperformance of the last-level cache.

One of the ways of improving cache performance, i.e., re-ducing destructive interferences while improving chances forconstructive sharing is to reduce the distance (in terms ofexecution cycles) to shared data. This distance, called reusedistance and discussed in detail in the next section, is a crit-ical metric to quantify for both the data accessed by a coreexclusively and the data accessed (shared) by multiple cores.Note that reducing reuse distances minimizes interveningdata accesses, thereby lowering chances for destructive in-

26

terferences (as there are fewer potential candidates that candisplace the data to be reused from the shared cache space)and increasing chances for constructive sharing (as we cancatch the reused data in the shared cache space with a higherprobability).

3. DATA REUSE, REUSE DISTANCES, ANDDATA LOCALITY

In this paper, we define “temporal reuse” as the reuse of apreviously accessed data element. On the other hand, “spa-tial reuse” can be defined as the access to a data elementwhich falls into the same cache block boundary as a pre-viously accessed data element. Note that, based on thesedefinitions, temporal reuse can be considered as a specialcase of spatial reuse. The important point to emphasize isthat the existence of a reuse does not tell much about theperformance of the associated data reference. This lattercharacteristic is captured through the concept of “locality”.More specifically, if, at the time of the reuse, the reused itemis caught in the cache (as opposed to missing in the cacheand being accessed from the main memory), we say that thereuse has been converted into locality.Clearly, the success of an execution is directly related to

the amount of data reuse that can be converted into local-ity. This conversion in turn is a function of both archi-tectural parameters such as cache capacities, line (block)sizes, and associativities as well as a program characteristiccalled the “reuse distance”. Specifically, shorter the reusedistance, higher the chances that the associated reuse willbe exploited in the cache space. Therefore, one of the maingoals of many previously-proposed code and data optimiza-tions can be summarized as reducing the reuse distances.As stated above, in a multicore architecture, on-chip caches

can be shared across different cores, and consequently, onecan think of two types of data reuses: intra-core reuse andinter-core reuse.1 As illustrated in Figure 3, an intra-corereuse takes place if two successive accesses to a data ele-ment/block are from the same core. In contrast, an inter-core reuse is said to occur if two successive accesses to adata element/block are from different cores. It needs to benoted that while the concept of intra-core reuse is commonto both single core and multicore machines, the concept ofinter-core reuse is unique to multicore machines. Irrespec-tive of whether we talk about intra-core or inter-core reusehowever, reducing the reuse distances would be beneficial asfar as cache performance is concerned because of the reasonsexplained earlier in Section 2.

4. TARGET ARCHITECTURES AND APPLI-CATION PROGRAMS

Most of our experiments in this paper have been con-ducted using the Intel Dunnington architecture, which isshown in Figure 2. We also report a set of results fromanother Intel architecture (Harpertown, shown in Figure 1)to demonstrate that our proposed scheme works well withdifferent architectures. Table 1 gives the important charac-teristics of these two commercial architectures. The appli-cation programs used in this work are from the SPECOMP1Since in our analysis in this paper, we assume one threadper core, we can use terms “intra-thread reuse” and “inter-thread reuse” in places of “intra-core reuse” and “inter-corereuse”, respectively.

p0

Instruction x: ... A[1] …

Instruction z: … A[1] …

On!chip Cache

Hierarchy

Instruction k: … A[10] …

……

……

…

Off!chipMemory

Instruction y: … A[10] …

……

A[10]A[1]

p1

Figure 3: An example of inter-core and intra-coredata reuses with respect to two array elements.There is an intra-core reuse of A[1] between instruc-tions x and z executed on core p0, and an inter-corereuse of A[10] between instructions y and k, whichare executed on cores p0 and p1, respectively.

Harpertown Dunnington

Number of Cores 8 cores (2 sockets) 12 cores (2 sockets)Clock Frequency 3.20GHz 2.40GHzL1 Cache 32KB, 8-way

64-byte line size3 cycle latency

32KB, 8-way64-byte line size4 cycle latency

L2 Cache 6MB, 24-way64-byte line size15 cycle latency

3MB, 12-way64-byte line size10 cycle latency

L3 Cache - 12MB, 16-way64-byte line size40 cycle latency

Off-Chip Latency ∼100 ns ∼50 ns

Table 1: Important parameters for our two Intelmulticore machines.

benchmark suite [8]. We used all the benchmarks in thissuite except wupwise which we could not compile in ourmulticore machines using the native compilers. The cachehit/miss statistics of these programs along with the totalamount of data they manipulate are given in Table 2. Allthe reuse distance results presented in this work are collectedusing Simics-GEMS framework [34], which provides accuratetiming models for multicore simulation. Specifically, usingthis platform, we simulated the configuration of Dunningtonmachine. The cache hit/miss statistics and execution timeresults on the other hand are collected using real executionon commercial machines (Dunnington and Harpertown).

5. QUANTIFYING INTRA-CORE AND INTER-CORE DATA REUSES

Our first set of results quantify the intra-core and inter-core reuses as well as temporal and spatial reuses in ourapplications and are presented in Figure 4. One can maketwo important observations from these results. First, mostof data reuse is intra-core, which can be attributed to thefact that most of these applications have been parallelizedsuch that inter-core data sharing is minimized to the max-imum extent possible. An exception to this rule is fma3d,where inter-core reuse dominates. Our second observationis that, with the exception of equake, spatial reuse domi-nates the spectrum in both intra-core and inter-core reuses.This is particularly true in inter-core reuses as spatial reusesaccount for 82.2% of all inter-core reuses, whereas the cor-responding figure in the case of intra-core reuses is about

27

Benchmark Description Data Set Harpertown DunningtonSize (MB) Miss Rates [%] Miss Rates [%]

L1 L2 L1 L2 L3

gafort Genetic algorithm 28.6 3.9 37.4 2.9 41.0 31.1swim Shallow water modeling 20.5 3.1 27.7 4.4 29.0 23.6mgrid Multigrid solver in 3D potential field 18.6 6.6 49.7 7.2 44.2 39.6applu Parabolic/elliptic partial differential equations 24.3 5.1 52.8 5.3 55.3 43.6galgel Fluid dynamics: analysis of oscillatory instability 61.2 1.9 36.3 2.4 39.7 33.5equake Finite element simulation; earthquake modeling 47.4 8.1 32.9 7.7 38.2 29.3apsi Temperature, wind, velocity and distribution of pollutants 29.9 8.7 41.5 3.9 46.8 35.7

fma3d Finite element crash simulation 18.2 4.1 56.3 6.8 40.1 41.1art Neural network simulation; adaptive resonance theory 26.1 6.7 38.9 7.3 47.7 39.4

ammp Computational chemistry 36.1 4.1 48.3 8.7 46.0 46.3

Table 2: Benchmarks used in our study.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Breakdown

of Data

Reuses

Intra!Core (Temporal) Intra!Core (Spatial)

Inter!Core (Temporal) Inter!Core (Spatial)

Figure 4: The breakdown of datareuses in our ten applications.

0%

3%

5%

8%

10%

13%

15%

18%

20%

23%

25%

0 1 2 3 4 5 7 8 9

10

11

12

13

14

15

16

17

18

19

20

21

22+

Fraction

of Data

Reuses

Log2(Reuse Distance)

Temporal Spatial

Figure 5: The distribution ofintra-core reuse distances.

0%

3%

5%

8%

10%

13%

15%

18%

20%

23%

25%

0 1 2 3 4 5 7 8 9

10

11

12

13

14

15

16

17

18

19

20

21

22+

Fraction

of Data

Reuses

Log2(Reuse Distance)

Temporal Spatial

Figure 6: The distribution ofinter-core reuse distances.

67.3%. This result means that, different cores mostly sharecache blocks (lines) rather than individual data elements.While the breakdown of data reuses into intra-core and

inter-core components is clearly important, as discussed ear-lier, one of the main factors which plays an important role indetermining whether a reuse would be converted into localityor not is the reuse distance. Figures 5 and 6 plot the dis-tribution of reuse distances in our application programs forintra-core reuses and inter-core reuses separately. Note that,each graph represents average statistics over all the ten ap-plication programs we have. The x-axis represents the reusedistance in log2 (of execution cycles) between references tothe same data block, and the y-axis gives the fraction of thereuses. Maybe the most important observation from theseresults is that, while intra-core reuse distances are gener-ally short, inter-core reuse distances are much higher. Thisresult, which holds true in the case of both temporal andspatial reuses, is important and indicates that, when twodifferent cores reuse the same data element or block, theydo not reuse it within a short period of time. For example,about 85% and 98% of the temporal and spatial inter-corereuses, respectively, have a reuse distance of 1024 cycles ormore, which are much higher than the corresponding per-centage values (16% and 21%) in the case of intra-core datareuse.These differences in reuse distance distributions across the

intra-core and inter-core reuses can also be interpreted asan indication that we probably need different strategies foroptimizing (taking advantage of) intra-core and inter-corereuses. For example, while a small increase in cache capacitycan help us convert most of intra-core reuses into locality,to do the same with inter-core reuses would require a muchlarger increase in cache capacity.Before we move to our detailed locality analysis, let us

first discuss whether these intra-core and inter-core reusesbelong to self or group reuses. In the context of this paper,

if two instructions that involve in a reuse are the same, theresulting reuse is termed as the “self-reuse”, i.e., two differ-ent instances of the same static instruction touches the samedata (temporal) or cache block (spatial); if they are differ-ent, they are called “group-reuse”. The results presented inFigure 7 indicate that most of the intra-core reuses are ac-tually self-reuses. It needs to be noted that all inter-corereuses are (by definition) group-reuses.

6. EVALUATION OF MULTI-LAYER CACHEHIERARCHY AND STATE-OF-THE-ARTDATA LOCALITY OPTIMIZATIONS

Our goal in this section is two-fold. First, we want to mea-sure the capability of the on-chip cache architecture in Dun-nington in converting intra-core and inter-core data reusesinto locality (hits in L2 and L3 caches). Second, we wantto evaluate a set of well-known data locality optimizationstrategies originally developed in the context of single coremachines with cache hierarchies, and quantify their successin converting intra-core and inter-core reuses into locality.Table 3 lists the set of optimizations considered in this work.The second column of this table gives a brief description ofeach optimization whereas the last column indicates the spe-cific implementation employed in selecting the values of theimportant parameters used in the corresponding optimiza-tion. For example, in tiling, we used the strategy explainedin [18] to select the tile sizes (also called blocking factors).

The y-axis in Figure 8 shows the fraction of data reusethat has been converted into locality (i.e., the fraction of thereused data elements/blocks that are caught in the cache atthe time of their reuses). Note that we present these resultsfor L2 and L3 caches separately. In this graph, the firstgroup of bars (marked as “original”) gives the results whenoriginal codes are used. The remaining groups correspondto the results obtained when using different data locality op-

28

Optimization Brief Description Reference

LinearA general optimization strategy that represents a loop transformation using a linear matrix.

The optimizations included are loop permutation, loop skewing and loop reversal.[49]

Loop Scaling Linear transformations augmented with loop scaling. [32]

TilingA restructuring strategy that partitions a loop’s iteration space into smaller chunks or blocks,so as to help ensure that data used in the loop remain in the cache until the reuse takes place.

[18]

Data LayoutChanging the memory layout of data (e.g., converting from row-major to column-major)

to maximize spatial reuse.[26]

Combined This option implements all of the optimizations listed above.

Table 3: The set of single-core centric data locality optimizations we evaluated on our multicore machine.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Breakdown

of Intra

!Core

Reuses

Self Group

Figure 7: The breakdown of(intra-core) self and group reusesin our ten applications.

0%10%20%30%40%50%60%70%80%

Intra Core

!(Temporal)

Intra Core

!(Spatial)

Inter Core

!(Temporal)

Inter Core

!(Spatial)

Intra Core

!(Temporal)

Intra Core

!(Spatial)

Inter Core

!(Temporal)

Inter Core

!(Spatial)

Intra Core

!(Temporal)

Intra Core

!(Spatial)

Inter Core

!(Temporal)

Inter Core

!(Spatial)

Intra Core

!(Temporal)

Intra Core

!(Spatial)

Inter Core

!(Temporal)

Inter Core

!(Spatial)

Intra Core

!(Temporal)

Intra Core

!(Spatial)

Inter Core

!(Temporal)

Inter Core

!(Spatial)

Intra Core

!(Temporal)

Intra Core

!(Spatial)

Inter Core

!(Temporal)

Inter Core

!(Spatial)

Original Linear Scaling Tiling Data!Layout Combined

Locality

L2 L3

Figure 8: The fraction of datareuse converted into locality inL2 and L3 caches.

10%

0%

10%

20%

30%

40%

50%

Improvement!in

!Execution!Cycles

Maximum Inter Core!Only

Figure 9: Ideal and practical ap-plication performance improve-ment by exploiting inter-corereuses.

timizations listed in Table 3. One can see from these resultsthat, while these optimizations are effective in converting amajority of intra-core data reuse into locality, the same can-not be said for the inter-core data reuse. In fact, when con-sidering all inter-core spatial reuses (i.e., all reuse distances),only 4.1% and 3.2% of them are converted into L2 and L3hits, respectively, even when activating all these locality op-timizations together. These numbers are much lower com-pared to 62.2% and 38% in the intra-core spatial reuse casewhen the same set of locality optimizations is used. Basedon these results, we can conclude that the on-chip cache hi-erarchy of the Intel Dunnington machine (see Figure 2) failsto take advantage of inter-core data reuses (due to the highreuse distances experienced by data accesses). Further, evena state-of-the-art locality optimization suite is not success-ful in converting inter-core reuses into cache hits in L2 orL3. These observations certainly call for novel strategies totake advantage of inter-core reuses. However, at this point,there are two issues that need to be answered. First, it isnot clear how much benefit optimizing for inter-core reusewould bring. Second and maybe more importantly, it is notclear how one can optimize for inter-core reuse without dis-torting intra-core reuse. Both these questions are addressedin the rest of this paper.

7. POTENTIAL AND COST OF INTER-COREREUSE OPTIMIZATION

We start our discussion with the potential impact of im-proving inter-core data reuse. The first bar for each appli-cation in Figure 9 gives the performance improvement thatcan be achieved using a hypothetical scheme which removesall inter-core misses (both temporal and spatial) without af-fecting any hit coming from exploiting intra-core reuse. Inthese experiments, our baseline is a locality-optimized ver-sion of the applications (using the version called“Combined”which is explained in the previous section). We see from

these results that the maximum savings that could be ob-tained from exploiting inter-core reuse without harming anyintra-core reuse is about 21.3% on average.

However, it needs to be noted that these results representthe maximum possible savings which may not be achieved inreality by a practical scheme. To see the difference betweenthe two, we next present results from a scheme that tries toexploit inter-core data reuse without considering the impactof doing so on intra-core reuse. The basic idea behind thisscheme is to ensure that inter-core reuses are exploited in theshared cache space even if doing so results in extra missesdue to not exploiting intra-core data reuse fully. Specifically,this scheme first runs an instrumented version of the applica-tion and identifies data elements shared by different subsetsof cores. Next, each load instruction in the code is taggedto indicate whether it accesses mostly core private data ordata shared across cores. More specifically, we tag a load in-struction as “shared” if at least 70% (a tunable parameter)2

of the references it makes are to shared data. After that, theapplication is executed again. This time however, any databrought by a shared load into any of the shared caches inthe system is prevented from being displaced by the accessmade by a non-shared (normal) load (this is implemented bydisabling the cache access by the normal loads). In a sense,under this scheme, the shared data elements are prioritizedover the private ones as far as shared caches are concerned.Our goal in performing experiments with this scheme is tomeasure the potential impact (in terms of both benefits andcosts) of exploiting inter-core reuse aggressively.

The second bar for each application in Figure 9 gives theexecution time improvement under this strategy. Our obser-vation is that, this approach, which optimizes for inter-coredata reuse aggressively, does not perform very well, result-ing in an average execution time improvement of 4.4%. Moreimportantly, this approach degrades the performance of four

2We also performed experiments with other ratios as well,and found that 70% generates better results than others.

29

applications (mgrid, applu, galgel, and apsi) as compared tothe baseline version. Based on these results, we can concludethat an aggressive inter-core reuse centric strategy may notbe the best option as far as overall application performanceis concerned. Motivated by this, in the next section, we dis-cuss a novel data locality optimization strategy that balancesinter-core and intra-core reuse optimizations in an attemptto maximize shared cache performance and minimize appli-cation execution time.

8. BALANCING INTER-CORE AND INTRA-CORE DATA REUSES

In this section, we present and evaluate a compiler-directedstrategy that considers both inter-core and intra-core datareuses in a balanced fashion. We used the SUIF compiler[48] from University of Stanford (as a source-to-source trans-lator) to implement this compiler-based strategy. The op-eration of our compiler-based strategy can be summarizedas follows. First, arrays accessed by the application are di-vided into equal-sized “data blocks” and each block is givena unique id. Second, for each core, the set of loop itera-tions assigned to it are divided into equal-sized “computa-tion blocks”. In this work, computation block is the unit forscheduling, i.e., we assign computation blocks to cores andschedule (in each core) one computation block at a time.Note that, computation blocks can come from different loopnests, and we capture data dependences and data sharingbetween them using two graph structures maintained by ourcompiler (these structures will be explained later in the pa-per). The key component of our strategy is the scheme weadopt to schedule computation blocks in both time (sched-ule slots) and space (cores). To select the computation blockto be scheduled in a given slot and core, we consider datareuse between each of the potential candidates and a subset(explained below) of already scheduled computation blocks.

8.1 Integrated Mapping and SchedulingOur goal is to fill in the entries of a “scheduling table”, the

high level structure of which is shown in Figure 10. An entry(x, y) in this scheduling table contains (when the schedul-ing is done) the id of the computation block scheduled atcore x in slot y. One can think of two types of data reusesregarding this scheduling table. First, for a given core, it-eration blocks scheduled at successive slots can have datareuse between them. Second, for a given slot, the itera-tion blocks scheduled at different cores can have data reuseamong them. Clearly, to maximize application performance,data reuses in both these directions should be maximized tothe greatest extent allowed by data dependences. In therest of our discussion, we refer to these two types of reuseas “vertical reuse” and “horizontal reuse”. It is importantto note that exploiting vertical reuse helps to improve cachebehavior at all levels of an on-chip cache hierarchy. In com-parison, optimizing for horizontal reuse helps to improvethe performance of shared caches. This is because when thecomputation blocks scheduled at different cores in the sameslot have high data reuse, the data brought by one of thecores to a shared cache would most probably be reused byone or more other cores while they are still in the cache (dueto short reuse distance).Let Λi,t be the computation block scheduled at core i in

slot t, and ∆i,t be the set of data blocks accessed by Λi,t.

t1

t2

t3

ts

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

cores

slots

6,3

Figure 10: Scheduling table for computation blocks.

Based on these definitions, we can define our constraints forvertical and horizontal reuses as follows:

• Vertical Constraint∀i : maximize | ∆i,t ∩∆i,t−1 |,3 and

• Horizontal Constraint∀j : maximize | ∆i,t ∩∆j,t | where j ̸= i.

Considering the scheduling table, these constraints can becaptured using a single expression:

Maximize{∑

0≤j≤(p−1)

∑(t−K)≤tk<t πj,tk | ∆i,t ∩∆j,tk |

+∑

0≤j≤(i−1) πj,t | ∆i,t ∩∆j,t |},

where πs are coefficients that represent the weights for dif-ferent reuses, i.e., how important a given reuse with respectto Λi,t. Note that this expression is for core i and scheduleslot t. That is, we want to select the computation block toschedule at core i in slot t, and want to make this selectionsuch that vertical and horizontal reuses could be maximized.There are several important observations regarding this ex-pression. First, when we evaluate this expression, all thescheduling decisions for all cores in steps 1 through t−1 areassumed to have already been made. Further, the schedulingdecisions for cores from 0 to i− 1 at step t are also assumedto have been made.4 As an example, Figure 11 illustratesthe situation when we are about to schedule a computationblock in core 7 at step 7. Second, the first part (the left op-eration of the addition operator) is there to capture reuseswith respect to the schedule steps before t, whereas the sec-ond part is there to capture reuses with respect to the step t.Third, K represents how far we should go in considering thereuse (see Figure 11), that is, what is the maximum distancebetween two slots across which a data reuse can be consid-ered? It is important to emphasize that we evaluate all pos-sible candidates for a given schedule slot using the expressionabove, and select the one which maximizes its value. As anexample, if K = 1, for Λ7,7, the expression to maximize is:π0,6 | ∆7,7 ∩∆0,6 | +π1,6 | ∆7,7 ∩∆1,6 | +π2,6 | ∆7,7 ∩∆2,6 |+π3,6 | ∆7,7∩∆3,6 | +π4,6 | ∆7,7∩∆4,6 | +π5,6 | ∆7,7∩∆5,6 |+π6,6 | ∆7,7∩∆6,6 | +π7,6 | ∆7,7∩∆7,6 | +π8,6 | ∆7,7∩∆8,6 |+π9,6 | ∆7,7 ∩ ∆9,6 | +π10,6 | ∆7,7 ∩ ∆10,6 | +π11,6 |∆7,7 ∩ ∆11,6 | +π0,7 | ∆7,7 ∩ ∆0,7 | +π1,7 | ∆7,7 ∩ ∆1,7 |+π2,7 | ∆7,7∩∆2,7 | +π3,7 | ∆7,7∩∆3,7 | +π4,7 | ∆7,7∩∆4,7 |+π5,7 | ∆7,7 ∩∆5,7 | +π6,7 | ∆7,7 ∩∆6,7 |.

There are two parameters one can study in the expres-sion above: K and π. Clearly, a higher value of K indicatesthat we are considering reuses that span a larger distance(in time). In contrast, a smaller value means we only care

3| | denotes set cardinality and represents the number ofcommon data blocks accessed by two computation blocks.4The cores are ordered starting from 0 given the target mul-ticore architecture.

30

t1

t2

t3

p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11

7,7

t4

t5

t6

t7

K

Figure 11: The computation blocks to be considered(indicated by the bounded rectangles) when schedul-ing computation block Λ7,7.

for data reuses in a close proximity in time. On the otherhand, weights (π values) indicate (for a given value of K)how important the different reuses are. One can adopt differ-ent strategies in assigning values to these weights, however,in principle, the closer reuses (in terms of both time andrelative core locations) should have higher weights. In ourwork, in assigning these weights, we take into account theunderlying on-chip cache topology of the target multicorearchitecture.More specifically, we represent the on-chip cache hierar-

chy of the target multicore architecture using a tree struc-ture, where the root denotes a shared last-level cache, theinternal nodes are the caches that are connected to thislast-level cache, and the leaves correspond to the cores. Ifthere are more than one last-level cache, we have a dis-joint union of trees, or equivalently, a forest. To preciselydescribe each core’s location in the on-chip cache hierar-chy, we assign ids to the caches at each level, and associatetheir corresponding internal nodes in the tree/forest struc-ture with these ids, as illustrated in the example given inFigure 12, where there is only one shared last-level cachewith id 0 (L3 0). Based on this, we employ a tuple of theform < N1, N2, · · · , NM−1, NM > to denote each path fromthe root to a leaf in an M -level tree, where Ni is the id ofa node (cache or core) on that path. If we ignore the coreid NM , the partial path < N1, N2, · · · , NM−1 > providesus with a core’s cache utilization information, i.e., whichcache(s) a core accesses. For instance, in Figure 12, the par-tial paths for cores 0 and 1 are both <0, 0, 0>, whereas thepartial paths for cores 2 and 3 are <0, 0, 1>. Note that,by comparing the partial paths of two cores, we can obtainthe cache sharing information, i.e., the number of caches twocores both connect to. We define this as the Core SharingDegree (CSD), and the following equation gives its mathe-matical formulation, where Px represents the set of cacheids on the partial path for core x:

CSD(i, j) = | Pi ∩ Pj | .

Continuing with Figure 12, it can be observed that the CSDof cores 0 and 1 is 3, whereas CSD(0, 2) equals to 2. Notethat, if two cores reside in two different last-level cache hi-erarchies (or trees), their CSD will be 0.Note further that, CSD gives us the weight information for

the horizontal reuse between two computation blocks sched-uled on different cores at any given slot. To capture theweight for the vertical reuse between two computation blocksscheduled in successive slots or two different slots within Kdistance, we define Time (Slots) Sharing Degree (TSD) as

TSD(t, tk) = 1− (t− tk)/(K + 1),

where t is the time slot under schedule, and tk is a time

0

L1_0

L3_0

L2_0

1 10 11

L2_1 L2_2

2

L1_1

3 4

L1_2

5 6

L1_3

7 8

L1_4

9

L1_5

0

0 1 2

0 1 2 3 4 5

0 1 10 112 3 4 5 6 7 8 9

(a) An example multicore architecture

(b) The corresponding architecture tree

Figure 12: An example multicore architecture andits corresponding architecture tree.

t1

t2

t3

p0 p1 p2 p3 p4 p5 p6 p7

7,7

t4

t5

t6

t7

K

Figure 13: The computation blocks to be considered(indicated by the bounded rectangles) when schedul-ing computation block Λ7,7, assuming the Harper-town architecture.

slot between t and t − K. Supposing K = 3, according tothis definition, TSD(t, t) is 1, TSD(t, t− 1) is 0.75, TSD(t,t− 2) is 0.5, and TSD(t, t− 3) is 0.25. TSD reflects the factthat, closer the two computation blocks with data reuse arescheduled, higher the chance with which their data reuse canbe converted into data locality in the cache, thanks to theshorter reuse distance. For any time slot tk that is beyondthe distance K of t, TSD(t, tk) equals to 0, which means wedo not consider the reuse between two computation blocksif the distance of their schedule slots is larger than K.

Based on these definitions, we now determine the weights(π values) according to the underlying on-chip cache topol-ogy for computation block Λi,t at core i in slot t. Basically,we couple TSD and CSD to form the weights. In particu-lar, the weight πj,tk for item | ∆i,t ∩ ∆j,tk | in the targetexpression to be optimized is calculated as:

TSD(t, tk)× CSD(i, j).

Table 4 shows the weight assignment for computation blockΛ7,7 in Figure 11 using this strategy, assuming that thetwelve cores are connected to a shared cache as depicted inFigure 12. Note that, with respect to the target computa-tion block Λ7,7, the computation blocks scheduled on cores6 and 7 have higher data reuse weights than others, sincetheir data will go through exactly the same set of caches atdifferent levels. On the other hand, the computation blocksscheduled in slots t6 and t7 have higher data reuse weightsthan others, since the data touched by those recent compu-tation blocks have a higher chance to be found in cache. Ingeneral, the weight values decrease as cores have less shar-ing with the target core i or as time (schedule) slots getfurther away from the target slot t. In addition, from Ta-

31

Weights p0 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11t4 0.25 0.25 0.25 0.25 0.5 0.5 0.75 0.75 0.25 0.25 0.25 0.25t5 0.5 0.5 0.5 0.5 1 1 1.5 1.5 0.5 0.5 0.5 0.5t6 0.75 0.75 0.75 0.75 1.5 1.5 2.25 2.25 0.75 0.75 0.75 0.75t7 1 1 1 1 2 2 3 N/A N/A N/A N/A N/A

Table 4: The weight assignments for computation blocks under consideration when scheduling computationblock Λ7,7 in Figure 11 (K = 3).

Weights p0 p1 p2 p3 p4 p5 p6 p7t4 0 0 0 0 0.25 0.25 0.25 0.5t5 0 0 0 0 0.5 0.5 0.5 1t6 0 0 0 0 0.75 0.75 0.75 1.5t7 0 0 0 0 1 1 1 N/A

Table 5: The weight assignments for computation blocks under consideration when scheduling computationblock Λ7,7 in Figure 13 (K = 3).

Input: Number of cores, N ; Set of all computation blocksC to be scheduled; Computation block dependence graph,CBDG(V,E); Distance, K; Id set of the partial path foreach core x, Px.Output: Computation blocks scheduled on each core.

1: t = 1;2: while C != Empty do3: St = FindSchedulableBlocks(t);4: while St != Empty do5: for i = 1 to i = N do6: if St == Empty then7: Λi,t = −1; //indicate synchronization8: continue;9: end if10: A = 0; //Record max data reuse between two

computation blocks11: B = 0; //Record the computation block selected12: for each computation block s in St do13: M = 0; //Record data reuse14: if t−K < 0 then15: t′ = 0;16: else17: t′ = t−K;18: end if19: for j = 0 to j = N − 1 do20: for tk = t′ to tk = t do21: if tk == t and j >= i then22: continue;23: end if24: πj,tk = TSD(t, tk)× CSD(i, j);25: M + = πj,tk× | ∆s ∩∆j,tk |;26: end for27: end for28: if M > A then29: A = M ;30: B = s;31: end if32: end for33: Λi,t = B;34: St = St −B;35: end for36: t++;37: end while38: C = C − St;39: end while

Figure 14: Pseudo-code of our mapping and schedul-ing algorithm.

ble 4, we can see that, intra-core reuses have high weights(e.g., weights on core 7), which is reasonable, since they areusually very short. As another example, Table 5 lists theweight assignments for computation blocks under consider-ation when scheduling computation block Λ7,7 in Figure 13,assuming that the Harpertown architecture depicted in Fig-ure 1 is used.

Next, we discuss the selection of parameter K. To choosea suitable value for K, we first construct an auxiliary datastructure called the Sharing Graph. In this graph, each nodecorresponds to a computation block and there is an edgebetween two nodes if the corresponding computation blocksshare data between them. After that, we set the value of Kto the average connectivity degree in the graph, i.e., the av-erage number of edges of a node. Clearly, this is a heuristicstrategy. The motivation behind this is that, there exists apossible situation where all the neighboring nodes of a par-ticular node (computation block) are scheduled at successiveschedule slots on the same core right before this node. Notethat, one can also set K to the highest connectivity degreein this graph, however, a larger K value will introduce morecalculations and result in longer time when running our al-gorithm.

8.2 Our Algorithm and ImplementationFigure 14 gives a sketch of our algorithm that exploits

both vertical and horizontal reuses to the maximum extentallowed by data dependences. With dependences, the set ofcomputation blocks that can be scheduled at a given timein a given core is restricted. To capture dependences amongcomputation blocks, we construct a Computation Block De-pendence Graph CBDG(V ,E), where each node in the graphdenotes a computation block and each edge represents a datadependence between two computation blocks. To determineΛi,t, we obtain the set of computation blocks that can bescheduled at slot t by calling a function named FindSchedu-lableBlocks(t). For each of the blocks, we then calculateits total data reuse with neighboring computation blocks inboth vertical and horizontal directions within the region de-limited by the K parameter (see lines 19-26 in Figure 14),and select the one that maximizes this value. If no compu-tation block can be scheduled at a given slot due to datadependences, we put −1 into the corresponding entry of thescheduling table, to indicate that a synchronization acrossall the cores is needed at that point.

We want to emphasize that, although different computa-tion blocks may have different execution times, since we haveglobal synchronizations across the cores during applicationexecution due to data dependences, even if the blocks withreuse are not processed exactly in the same scheduled slotsas expected, the overall execution is close to our order cap-tured in the scheduling table. In other words, one can expectsignificant benefits from our strategy even though cores donot execute in a lock-step fashion.

Our strategy for selecting the computation block size can

32

A7,6

6,7

5,5

3,6

30

10

40

2

B

35

555,4

8

2,7

1

Figure 15: Sharing graph for two candidate compu-tation blocks A and B.

be summarized as follows. We start by assuming that thecomputation block size is S, an unknown parameter. Afterthat, considering all references in the innermost loop posi-tion, we find an analytical expression that gives the totalamount of data accessed by all the references (denoted byT ). Then, we determine the value of S such that the re-sulting T is smaller than the L1 cache capacity of the targetarchitecture. We found that, as long as this computationblock size selection strategy is used, data block size does notmatter too much, since the data accessed by inner loops willbe captured in the cache. Note also that, both the compu-tation block size parameter and the K parameter are inputsto our algorithm given in Figure 14.

8.3 ExampleWe now go over an example to illustrate the scheduling

step of our algorithm. Let us consider the target sched-ule slot 7 at core 7 in Figure 13 on the Harpertown archi-tecture. Assume that, at this point, by checking the de-pendence graph, we find two candidate computation blocks,A and B, which can be scheduled in this slot. The datasharing between these two computation blocks and othercomputation blocks is depicted in Figure 15 (the computa-tion blocks that are not shown in the figure have no datasharing with computation blocks A and B). The weightof each edge denotes the number of data blocks shared bythe connected nodes. Based on the π value assignmentslisted in Table 5 with K = 3, the total data reuse (includingboth vertical and horizontal) between computation block Aand other computation blocks within its range K can becalculated as 1.5 × 30 + 1 × 40 + 0.5 × 10 + 0 × 2 = 90,whereas the total data reuse for computation block B is1.5× 35 + 1× 55 + 0.25× 8 + 0× 1 = 109.5. Since the lat-ter has higher data reuse, we select computation block B asΛ7,7 to fill in the target slot, and proceed to the next slot.

9. EXPERIMENTAL EVALUATIONIn this section, we present an experimental evaluation of

the proposed data locality optimization algorithm that bal-ances inter-core and intra-core data reuses. Most of ourresults are collected using the commercial Intel Dunning-ton multicore machine, whose important characteristics aregiven earlier in Section 4. Figure 16 gives improvements(reductions) in execution time under our algorithm (markedas “Balanced” in the graph) and six other strategies. Thesestrategies differ from one another in how they assign weights.The weight assignment strategy our scheme employs has al-ready been explained earlier in Section 8.1. Based on thatstrategy, we selected a K value of 3 for most of our applica-tions except for equake and applu for which the values of 5and 4 are selected, respectively. Also, for each applicationin our experimental suite, to determine the block size, we

Equal Weights

I II III IV V

Figure 19: Graphical illustration of Scheme-Ithrough Scheme-V. Circles denote the target slotsunder consideration. In each figure, the shaded partindicates the region (slots) for which we have non-zero weights.

adopted the approach discussed earlier. The selected com-putation block size (the number of iterations) turned out tobe 2K for applications gafort, apsi, and fma3d, and 4K forthe remaining applications.

We can summarize the weight assignment strategies usedby other schemes tested as follows (we use the term “targetslot” to refer to a slot for which we are scheduling a compu-tation block, and also illustrate the graphic representationsin Figure 19 for the first five schemes):

• Scheme-I: In this strategy, we do not consider inter-corereuse, and consequently, assign weights only to the slots inthe same column as the target slot.

• Scheme-II: This strategy considers only short-term,inter-core reuse; so, only the slots that have already beenscheduled in the same row as the target slot are assignedweights.

• Scheme-III: This scheme assigns equal weights to all(already) scheduled slots within the region.

• Scheme-IV: In assigning the weights, this scheme con-siders only the left and above neighbors of the target slot.

• Scheme-V: This is a combination of Scheme-I and SchemeII, that is, it considers only the row and the column of thetarget slot.

• Scheme-VI: This scheme considers all cache layers ex-cept for the last one. That is, in Dunnington, it decides theweights by considering L1 and L2 layers only. Our goal inmaking experiments with this version is to see how impor-tant to consider the entire cache hierarchy.

The most important observation from Figure 16 is thatour scheme generates better results than the other six assign-ment strategies tested. It is important to note that Scheme-Iin a sense represents the state-of-the-art data locality op-timization for single-core architectures. To better explainwhy our scheme is better than Scheme-I and Scheme-II, wepresent in Figure 17, Figure 18 and Figure 20 reductionsin cache misses at different layers by Scheme-I, Scheme-IIand our scheme, respectively. It is important to observethat, as expected, Scheme-I improves L1 performance sig-nificantly. However, the relative improvements in L2 andL3 are not very significant. In contrast, Scheme-II results inbetter L2 and L3 performance but its L1 performance is rel-atively poor. In comparison, our scheme is able to optimizeperformance of the caches at all layers. More specifically, itbrings 22.1%, 23.1% and 23.7% reductions, on average, inL1, L2 and L3 cache misses, respectively. The correspond-ing savings with Scheme-I are 26.6%, 10.2% and 9.8% andthose with Scheme-II are 8.8%, 15.7% and 15.5%, in thesame order.

Looking at the remaining schemes, as expected, Scheme-III does not perform well at all since giving equal weightsto near and far reuses indiscriminately causes the compilermissing some important opportunities as far as converting

33

0%

5%

10%

15%

20%

25%

30%

35%

Improvement in

Execution

Cycles

Balanced Scheme!I Scheme!II Scheme!III

Scheme!IV Scheme!V Scheme!VI

Figure 16: Performance im-provements for each applicationunder different strategies.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

L1 L2 L3

Reduction

in M

isses

gafort swim mgrid applu galgel

equake apsi fma3d art ammp

Figure 17: Reduction in cachemisses at different layers byScheme-I.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

L1 L2 L3

Reduction

in M

isses



Figure 18: Reduction in cachemisses at different layers byScheme-II.

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

L1 L2 L3

Reduction

in M

isses



Figure 20: Reduction in cachemisses at different layers by ourscheme.

0%

5%

10%

15%

20%

25%

30%

35%Im

provement in

Execution

Cycles

1 2 3 4 5

Figure 21: Performance im-provements with different valuesof parameter K.

0%

5%

10%

15%

20%

25%

30%

35%

Improvement in

Execution

Cycles

1KB 2KB 4KB 8KB

Figure 22: Performance im-provements with different com-putation block sizes.

reuse into locality is concerned. Scheme-IV on the otherhand performs better than schemes I, II and III, mainly be-cause it focuses on the closest reuses. However, the schemethat comes closest to ours is Scheme V as it is able to op-timize reuses in both horizontal and vertical directions. Fi-nally, Scheme-VI does not perform very well, which indicatesthat, if we want to maximize locality improvements, all lay-ers of the on-chip cache hierarchy should be considered.We now study the impact of varying the value of param-

eter K on application performance. The results with differ-ent values of this parameter (from 1 to 5) are presented inFigure 21. We see from these results that, except for twoapplications (apsi and art), the values selected by our ap-proach generated the best results. Based on these results,we can conclude that our approach to selecting the value ofthe K parameter is very successful in practice.Our next set of experiments evaluate the impact of com-

putation block sizes on application performance, which aregiven in Figure 22. One can observe from these results thatthe maximum savings are obtained when using the blocksizes determined by the strategy explained in Section 8.2except for art. In this application, a smaller block size per-forms better, most probably because the amount of dataaccessed by the computation block size selected by our strat-egy exceeded the L1 cache capacity, or caused extra conflictmisses.Our final set of results presented in Figure 23 are with

the Harpertown machine. We see that, in general, the re-sults with this architecture are lower than those obtainedwith Dunnington. This is mainly because the Dunningtonmachine has a more complex cache hierarchy than Harper-town, and this makes data locality optimization even more

critical for achieving maximum performance. The averageperformance improvement our scheme brings in Harpertownis about 16.3% (compared to 18.8% in Dunnington).

10. RELATED WORKA data reuse theory for locality optimization is proposed

by Wolf and Lam [49] in the context of single-core machines.They classify the data reuse into four classes: self-temporalreuse, self-spatial reuse, group-temporal reuse and group-spatial reuse. Wu et al [50] extend this intra-processor clas-sification into eight types by introducing four more typesof reuses. The reuse distance, also called LRU stack dis-tance and first mentioned in [35], on the other hand, is aquantitative metric for data reuse in programs. Differentreuse distance implementations have been presented in [6,9, 37, 45] for single-core machines. [43], [24] and [31] pro-pose multicore-aware reuse distance models, which can beused to collect statistics of the type shown in Figure 5 andFigure 6. In comparison, we present a detailed analysis ofreuse distances for both inter-core and intra-core data reusesand propose a compiler-based optimization strategy for max-imizing the performance of shared caches.

A few data locality optimizations targeting multicores havebeen proposed in recent years. Chen et al [13] discuss an ap-plication mapping strategy for multicores that tries to op-timize data accesses from power and performance perspec-tives. Ascia et al [7] also employ a multi-objective algorithmfor application mapping on multicores, but use evolutionarycomputing techniques. Sarkar and Tullsen present a data-cache aware compilation to find a layout for data objectswhich minimizes inter-object conflict misses [42]. Kandemiret al [27] discuss a multicore mapping strategy which does

34

0%

5%

10%

15%

20%

25%

30%

35%

Improvement in

Execution

Cycles

Balanced Scheme!I Scheme!II Scheme!III

Scheme!IV Scheme!V Scheme!VI

Figure 23: Performance improvements for each ap-plication under our algorithm and six other strate-gies on the Harpertown multicore machine.

not customize mapping based on target on-chip cache hierar-chy. Zhang et al [52] evaluate the impact of cache sharing onparallel workloads. Zhang et al [51] study reference affinityand present a heuristic model for data locality. Chishti et al[15] propose a data mapping scheme that utilizes replicationand capacity allocation techniques to improve locality. Luet al [33] develop a compile-time framework for data localityoptimization via non-cannicoal data layout transformation.In [41], the authors solve the allocation and scheduling prob-lems of multicore through integer programming and con-straint programming, respectively. Jin et al [25] implementa page-level data to L2 cache slice mapping for multicores.Chu et al [17] target codes with fine-grain parallelism by par-titioning loop bodies in a locality-aware fashion. Chou et al[16] develop a run-time strategy for allocating the applica-tion tasks to platform resources in a network-on-chip. Cadeand Qasem [10] present a model that captures the interac-tion between data locality and parallelism in the context ofpipeline parallelism. These prior data locality optimizationsdo not consider target multicore cache hierarchies explicitly.Although the work in [28] also takes into account on-chipcache hierarchies, the vertical and horizontal reuses are ex-ploited in separate steps. Compared to [28], our proposedstrategy conducts an integrated mapping and scheduling forcomputation blocks, which maximizes the vertical and hor-izontal reuses at each schedule step at the same time.In addition, there have been several hardware-based schemes

for shared cache management. Kim et al [29] propose staticand dynamic L2 cache partitioning algorithms that optimizefairness while improving throughput. Suh et al [46] measurecache utility for each application at runtime and dynami-cally vary the cache space for executing threads. Qureshiand Patt [39] monitor each application at runtime using ahardware circuit and partition a shared cache among threadsbased on their possible reductions in cache misses. Changet al [12] use timeslicing as a means of cache partitioningto guarantee cache resources for each application for a cer-tain time quantum. Prashanth et al [36] develop a dynamiccache partitioning scheme to alleviate intra-application con-flicts. Other strategies rather than cache partitioning havealso been explored to decrease interferences in shared caches,such as set pinning [44] and thread-aware replacement poli-cies [23]. The modeling of cache sharing among applicationsinclude the work [11] and [38]. Prior efforts on reducingcache contention at the operating system (OS) level haveexplored two directions: software-based cache partitioning

[40, 47] and thread scheduling [30, 20, 19, 14]. Our compiler-based approach is complementary to most of these efforts.Further, our inter-core reuse distance analysis can be bene-ficial to some of these hardware and OS based schemes.

11. CONCLUDING REMARKSIn this paper, we have made four contributions. First, we

have presented an in-depth investigation on the character-istics of inter-core and intra-core data reuses. Our observa-tion is that, 1) while intra-core reuse distances are generallyshort, inter-core reuse distances tend to be much higher, and2) both temporal and spatial inter-core reuses exhibit similarcharacteristics as far as reuse distances are concerned. Sec-ond, we have conducted a comprehensive study on how ef-fective the on-chip cache hierarchies of current multicore ar-chitectures and state-of-the-art code/data optimizations arein exploiting available inter-core and intra-core data reusesin multithreaded applications separately. The conclusionis that neither of them is successful in converting inter-corereuses into data locality. Third, we have demonstrated that,although exploiting all available inter-core reuse can boostoverall application performance by around 21.3% on average,trying to optimize for inter-core reuse without consideringof doing so on intra-core reuse can actually perform worsethan optimizing for intra-core reuse alone. Therefore, as thefourth contribution, we have developed a novel data local-ity optimization strategy for multicores that balances bothinter-core and intra-core reuses. Specifically, our approachimplements an integrated mapping and scheduling strategythat maximizes both “vertical” and “horizontal” reuses con-sidering the on-chip cache hierarchy in the target architec-ture. Our results collected using ten applications indicatethat this unified approach brings about 23.1% and 23.7%improvements in L2 and L3 cache hits, respectively, result-ing in an average performance (execution time) improvementof 18.8%. Our results also show that the proposed strategybrings consistent benefits under different values of major ex-perimental parameters.

12. REFERENCES[1] AMD’s Istanbul six-core Opteron processors.

http://techreport.com/articles.x/17005.

[2] IBM Power7. http://en.wikipedia.org/wiki/POWER7.

[3] Intel core i7 processor.http://www.intel.com/products/processor/corei7/specifications.htm.

[4] Intel Xeon processors.http://en.wikipedia.org/wiki/Xeon.

[5] Platform 2015: Intel processor and platform evolutionfor the next decade. http://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW Trends borkar 2015.pdf, 2005.

[6] G. Almasi et al. Calculating stack distances efficiently.SIGPLAN Not., 2003.

[7] G. Ascia et al. Multi-objective mapping formesh-based noc architectures. Proc. of CODES+ISSS,2004.

[8] V. Aslot et al. SPECOMP: A new benchmark suite formeasuring parallel computer performance. OpenMPShared Memory Parallel Programming, ISBN978-3-540-42346-1, 2001.

35

[9] B. Bennett and V.J.Kruskal. LRU stack processing.IBM Journal of Research and Development, 1975.

[10] M. J. Cade and A. Qasem. Balancing locality andparallelism on shared-cache multi-core systems. Proc.of HPCC, 2009.

[11] D. Chandra et al. Predicting inter-thread cachecontention on a chip multi-processor architecture.Proc. of HPCA, 2005.

[12] J. Chang and G. S. Sohi. Cooperative cachepartitioning for chip multiprocessors. Proc. of ICS,2007.

[13] G. Chen et al. Application mapping for chipmultiprocessors. Proc. of DAC, 2008.

[14] S. Chen et al. Scheduling threads for constructivecache sharing on cmps. Proc. of SPAA, 2007.

[15] Z. Chishti et al. Optimizing replication,communication, and capacity allocation in CMPs.Proc. of ISCA, 2005.

[16] C. L. Chou and R. Marculescu. User-aware dynamictask allocation in networks-on-chip. Proc. of DATE,2008.

[17] M. Chu et al. Data access partitioning for fine-grainparallelism on multicore architectures. Proc. of Micro,2007.

[18] S. Coleman and K. S. McKinley. Tile size selectionusing cache organization and data layout. Proc. ofPLDI, 1995.

[19] A. Fedorova. Operating system scheduling for chipmultithreaded processors. PhD Thesis, HarvardUniversity, 2006.

[20] A. Fedorova et al. Cache-fair thread scheduling formulticore processors. Technical Report, HarvardUniversity, 2006.

[21] P. P. Gelsinger. Intel architecture press briefing.http://download.intel.com/pressroom/archive/reference/Gelsinger briefing 0308.pdf, 2008.

[22] P. Gepner et al. Second generation quad-core IntelXeon processors bring 45 nm technology and a newlevel of performance to HPC applications. Proc. ofICCS, Part I, 2008.

[23] A. Jaleel et al. Adaptive insertion policies formanaging shared caches. Proc. of PACT, 2008.

[24] Y. Jiang et al. Is reuse distance applicable to datalocality analysis on chip multiprocessors? Proc. of CC,2010.

[25] L. Jin et al. A flexible data to L2 cache mappingapproach for future multicore processors. Proc. ofMSPC, 2006.

[26] M. Kandemir. A compiler technique for improvingwhole-program locality. Proc. of POPL, 2001.

[27] M. Kandemir et al. Optimizing shared cache behaviorof chip multiprocessors. Proc. of MICRO, 2009.

[28] M. Kandemir et al. Cache topology aware computationmapping for multicores. Proc. of PLDI, 2010.

[29] S. Kim et al. Fair cache sharing and partitioning in achip multiprocessor architecture. Proc. of PACT, 2004.

[30] R. Knauerhase et al. Using OS observations to improveperformance in multicore systems. IEEE Micro, 2008.

[31] M. Kulkarni et al. Accelerating multicore reusedistance analysis with sampling and parallelization.Proc. of PACT, 2010.

[32] W. Li. Compiling for NUMA parallel machines.Doctoral Dissertation, Cornell University, 1993.

[33] A. Lu et al. Data layout transformation for enhancingdata locality on nuca chip multiprocessors. Proc. ofPACT, 2009.

[34] M. M. K. Martin et al. Multifacet’s generalexecution-driven multiprocessor simulator (GEMS)toolset. SIGARCH Comput. Archit. News, 2005.

[35] R. Mattson et al. Evaluation techniques for storagehierarchies. IBM Systems Journal, 1970.

[36] S. Muralidhara et al. Intra-application shared cachepartitioning for multithreaded applications. Proc. ofPPoPP, 2010.

[37] F. Olken. Efficient methods for calculating the successfunction of fixed space replacement policies. TechnicalReport, Lawrence Berkeley Laboratory, 1981.

[38] P. Petoumenos et al. Modeling cache sharing on chipmultiprocessor architectures. Proc. of IEEEInternationl Symposium on WorkloadCharacterization, 2006.

[39] M. K. Qureshi and Y. N. Patt. Utility-based cachepartitioning: a low-overhead, high-performance,runtime mechanism to partition shared caches. Proc.of Micro, 2006.

[40] N. Rafique et al. Architectural support for operatingsystem-driven CMP cache management. Proc. ofPACT, 2006.

[41] M. Ruggiero et al. Communication-aware allocationand scheduling framework for stream-orientedmulti-processor systems-on-chip. Proc. of DATE, 2006.

[42] S. Sarkar and D. M. Tullsen. Compiler techniques forreducing data cache miss rate on a multithreadedarchitecture. Proc. of HiPEAC, 2008.

[43] D. Schuff et al. Multicore-aware reuse distanceanalysis. Workshop on Performance Modeling,Evaluation, and Optimisation of UbiquitousComputating and Networked Systems, 2010.

[44] S. Srikantaiah et al. Adaptive set pinning: Managingshared caches in chip multiprocessors. Proc. ofASPLOS, 2008.

[45] R. Sugumar and S. Abraham. Multi-configurationsimulation algorithms for the evaluation of computerarchitecture designs. Technical Report, University ofMichigan, 1993.

[46] G. E. Suh et al. Dynamic partitioning of shared cachememory. Journal of SuperComputing, 2004.

[47] D. Tam et al. Managing shared L2 caches on multicoresystems in software. Proc. of WIOSCA, 2007.

[48] R. Wilson et al. The suif compiler system: aparallelizing and optimizing research compiler.Technical Report, University of Stanford, 1994.

[49] M. E. Wolf and M. S. Lam. A data locality optimizingalgorithm. Proc. of PLDI, 1991.

[50] J. Wu et al. Parallel data reuse theory for openmpapplications. Proc. of SNPD, 2009.

[51] C. Zhang et al. A hierarchical model of data locality.Proc. of POPL, 2006.

[52] E. Zhang et al. Does cache sharing on modern CMPmatter to the performance of contemporarymultithreaded programs? Proc. of PPOPP, 2010.

36

Documents

Studying inter-core data reuse in multicoreshostel.ufabc.edu.br/~cak/inf103/studying_inter... · on-chip cache topologies, and interconnect structures [5]. Unfortunately, in this