14
A Practical Data Classication Framework for Scalable and High Performance Chip-Multiprocessors Yong Li, Student Member, IEEE, Rami Melhem, Fellow, IEEE, and Alex K. Jones, Senior Member, IEEE AbstractState-of-the-art chip multiprocessor (CMP) proposals emphasize general optimizations designed to deliver computing power for many types of applications. Potentially, signicant performance improvements that leverage application-specic characteristics such as data access behavior are missed by this approach. In this paper, we demonstrate how scalable and high-performance parallel systems can be built by classifying data accesses into different categories and treating them differently. We develop a novel compiler- based approach to speculatively detect a data classication termed practically private, which we demonstrate is ubiquitous in a wide range of parallel applications. Leveraging this classication provides efcient solutions to mitigate data access latency and coherence overhead in todays many-core architectures. While the proposed data classication scheme can be applied to many micro-architectural constructs including the TLB, coherence directory, and interconnect, we demonstrate its potential through an efcient cache coherence design. Specically, we show that the compiler-assisted mechanism reduces an average of 46% coherence trafc and achieves up to 12%, 8%, and 5% performance improvement over shared, private, and state-of-the-art NUCA-based caching, respectively, depending on scenarios. Index TermsPractically private, data classication, pipelined parallel, multi-threaded parallel, OpenMP, compilers, cache coherence 1 INTRODUCTION A LTHOUGH chip multiprocessors (CMPs) have become a popular class of daily computing platforms due to the thermally aware mechanism to leverage increasing numbers of transistors to improve performance, this performance po- tential is challenged by a number of limiting factors. These factors include remote data access latency, coherence over- head, inter-core communication penalty, etc. While many effective improvements [1][6] to CMPs have been proposed to address these issues, many optimization opportunities are lost unless specic characteristics of the application(s) run- ning on the system can be leveraged. In particular, the lack of information about characteristics of a running application such as data access patterns, communication behavior, mem- ory usage, etc., typically results in a more generic but less efcient method for utilizing on-chip caches, interconnect and coherence directories. Thus, it is critical to design architectures that are aware of application behavior, as proposed recently in the research literature [7][10]. To determine relevant application behavior, we have stud- ied various parallel benchmarks and found that many multi- threaded applications, such as those from the SPLASH 2 [11], PARSEC [12] and RODINIA [13] benchmark suites, exhibit regular data access behavior that can be discovered statically. For example, in these benchmarks stack accesses are predom- inantly local/private and data in the text segment can be largely classied as shared. Static global and dynamically allocated objects are typically an application dependent mix- ture of private and shared data. However, this mix can often be understood by studying how and where data objects are declared/allocated and used in the source code. For example, data-parallel applications typically allocate objects dynami- cally within a particular working thread and these objects are accessed by that thread with minimal sharing with other threads. In this paper we present compiler analysis techniques to distinguish between private and shared style data accesses. To reduce the compilation complexity and avoid NP-complete compiler analyses, we introduce a new data classication called practically private. We call data practical- ly private when the compiler cannot easily prove that the memory is accessed privately (e.g., exclusively by one thread/core) but speculatively determines that private access is highly probable and any sharing is minimal. Our data classication promotes access proximity for practically private data, which would in fact be treated as shared in many run time schemes. Moreover, the proposed classica- tion is helpful in designing a more efcient coherence proto- col that distinguishes practically private versus shared data, which have remarkably distinct sharing and coherence behaviors. We demonstrate that practically private data is ubiquitous across a variety of applications and a high per- centage of practically private data is dominated by local accesses. Finally, we provide a case study of a system that utilizes the page-table as a conduit to communicate applica- tion data classication information to the runtime system and a cache coherence protocol that leverages this Y. Li and A.K. Jones are with the Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261. E-mail: [email protected]; [email protected]. R. Melhem is with the Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260. E-mail: [email protected]. Manuscript received 10 Nov. 2012; revised 22 July 2013; accepted 29 July 2013. Date of publication 14 Aug. 2013; date of current version 12 Nov. 2014. Recommended for acceptance by J. Xue. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identier below. Digital Object Identier no. 10.1109/TC.2013.161 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014 2905 0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

A Practical Data Classification Framework for Scalable and High Performance Chip-Multiprocessors

  • Upload
    alex-k

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

A Practical Data Classification Frameworkfor Scalable and High Performance

Chip-MultiprocessorsYong Li, Student Member, IEEE, Rami Melhem, Fellow, IEEE, and Alex K. Jones, Senior Member, IEEE

Abstract—State-of-the-art chip multiprocessor (CMP) proposals emphasize general optimizations designed to deliver computing powerfor many types of applications. Potentially, significant performance improvements that leverage application-specific characteristicssuch as data access behavior are missed by this approach. In this paper, we demonstrate how scalable and high-performance parallelsystems can be built by classifying data accesses into different categories and treating them differently. We develop a novel compiler-basedapproach to speculatively detect a data classification termed practically private, whichwedemonstrate is ubiquitous in awide rangeof parallel applications. Leveraging this classification provides efficient solutions tomitigate data access latency and coherence overheadin today’smany-core architectures. While the proposed data classification scheme can be applied to manymicro-architectural constructsincluding the TLB, coherence directory, and interconnect, we demonstrate its potential through an efficient cache coherence design.Specifically, we show that the compiler-assisted mechanism reduces an average of 46% coherence traffic and achieves up to 12%, 8%,and5%performance improvement over shared, private, and state-of-the-art NUCA-based caching, respectively, depending on scenarios.

Index Terms—Practically private, data classification, pipelined parallel, multi-threaded parallel, OpenMP, compilers, cache coherence

1 INTRODUCTION

ALTHOUGH chip multiprocessors (CMPs) have become apopular class of daily computing platforms due to the

thermally aware mechanism to leverage increasing numbersof transistors to improve performance, this performance po-tential is challenged by a number of limiting factors. Thesefactors include remote data access latency, coherence over-head, inter-core communication penalty, etc. While manyeffective improvements [1]–[6] to CMPs have been proposedto address these issues, many optimization opportunities arelost unless specific characteristics of the application(s) run-ning on the system can be leveraged. In particular, the lack ofinformation about characteristics of a running applicationsuch as data access patterns, communication behavior, mem-ory usage, etc., typically results in a more generic but lessefficientmethod for utilizing on-chip caches, interconnect andcoherencedirectories. Thus, it is critical todesignarchitecturesthat are aware of applicationbehavior, as proposed recently inthe research literature [7]–[10].

To determine relevant application behavior, we have stud-ied various parallel benchmarks and found that many multi-threaded applications, such as those from the SPLASH 2 [11],PARSEC [12] and RODINIA [13] benchmark suites, exhibit

regular data access behavior that can be discovered statically.For example, in these benchmarks stack accesses are predom-inantly local/private and data in the text segment can belargely classified as shared. Static global and dynamicallyallocated objects are typically an application dependent mix-ture of private and shared data. However, this mix can oftenbe understood by studying how and where data objects aredeclared/allocated and used in the source code. For example,data-parallel applications typically allocate objects dynami-cally within a particular working thread and these objects areaccessed by that thread with minimal sharing with otherthreads.

In this paper we present compiler analysis techniquesto distinguish between private and shared style dataaccesses. To reduce the compilation complexity and avoidNP-complete compiler analyses, we introduce a new dataclassification called practically private. We call data practical-ly private when the compiler cannot easily prove that thememory is accessed privately (e.g., exclusively by onethread/core) but speculatively determines that privateaccess is highly probable and any sharing is minimal. Ourdata classification promotes access proximity for practicallyprivate data, which would in fact be treated as shared inmany run time schemes. Moreover, the proposed classifica-tion is helpful in designing a more efficient coherence proto-col that distinguishes practically private versus shared data,which have remarkably distinct sharing and coherencebehaviors. We demonstrate that practically private data isubiquitous across a variety of applications and a high per-centage of practically private data is dominated by localaccesses. Finally, we provide a case study of a system thatutilizes the page-table as a conduit to communicate applica-tion data classification information to the runtime systemand a cache coherence protocol that leverages this

• Y. Li and A.K. Jones are with the Department of Electrical and ComputerEngineering, University of Pittsburgh, Pittsburgh, PA 15261.E-mail: [email protected]; [email protected].

• R. Melhem is with the Department of Computer Science, University ofPittsburgh, Pittsburgh, PA 15260. E-mail: [email protected].

Manuscript received 10 Nov. 2012; revised 22 July 2013; accepted 29 July 2013.Date of publication 14 Aug. 2013; date of current version 12 Nov. 2014.Recommended for acceptance by J. Xue.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TC.2013.161

IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014 2905

0018-9340 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

information to reduce coherence traffic and improve perfor-mance. In particular, this paper makes the following newcontributions compared to prior related work:

We introduce practically private, a novel data classificationthat represents an important category of data accessbehavior for a variety of multi-threaded applications.Our experiments on benchmarks with different parallelmodels in various domains show that a high percentageof the data accesses are practically private.We point out several important issues of the state-of-the-art runtime data classification scheme and illustrate howthe proposed concept of practically private can improvethe efficiency and effectiveness over existing data classi-fication approach.We present a compilermethodology and implementationbased on the SUIF Sharlit framework [14] to perform thedata classification. Using profiling to verify the compiler-based data classification, we demonstrate that more than83% of practically private data blocks are actually privateand up to an additional 20% are mostly private (sharedinfrequently, with only one other core).We describe a customized memory allocator to pass thecompiler identified classification information to the un-derlying architecture. Optimizations are applied to theallocator to address a false sharing problem and reducecoherence overhead for private memory blocks.Finally, we developed an efficient cache organization andspecialized coherence protocol that demonstrates theusage of the proposed data classification for CMP perfor-mance improvement. Our data classification aware cachereduces coherence traffic by 46%. It achieves a 12%performance improvement over a static non-uniformaccess cache (S-NUCA) in which data is placed in aninterleaved manner determined by the data’s physicaladdress, an 8% improvement over private caches, anda 5% improvement over the state-of-the-art runtimeclassification-based cache depending on scenarios.

The rest of the paper is organized as follows: Section 2describes related efforts for utilizing data classification oraccess pattern analysis to optimizeCMPperformance. Section3 introduces our compiler analysis technique for data classifi-cationand identification. In Section 4,wedescribe a case studysystem inwhich the compiler analysis is utilized.We examinethe effectiveness of the compiler for identifying private, prac-tically private and shared data and demonstrate the perfor-mance gain of our compiler-assisted case study system inSection 5. Finally, we draw some conclusions and describeon-going and future work directions of this effort in Section 6.

2 RELATED WORK

A number of prior efforts have attempted to utilize applica-tion data access characteristics to optimize different hardwarecomponents ofCMPs such as caches, interconnects, coherencedirectories, etc.

In reactive NUCA (R-NUCA) [7], data accesses are classi-fied as private, shared, and read-only at the page granularity.Data is assumed to be private until a second core accesses thedata (signaled by a translation look-aside buffer (TLB) miss).Data lines from private pages are cached locally to improveaccess latencywhile lines from shared pages are cached using

S-NUCA style [15] (i.e., distributed shared data placement) toimprove capacity.

Another relevant effort is Jin and Cho’s software orientedshared (SOS) cache management [8]. They classify data ac-cesses to a range of memory locations (returned by amemoryallocation function such as malloc()) into several categoriessuch as Even Partition, Scattered, Dominant Owner, Small-Entity, and Shared. Applications are profiled and memoryaccesses are matched to one of the above categories. Hints areprovided to the memory allocation functions based on thematched access patterns. Pageswithin thememory ranges areplaced to cache tiles indicated by these hints.

R-NUCA and SOS are techniques that leverage executionhistory and profiling to detect the data access pattern. Inaddition, they work at the operating system (OS) page granu-larity, which has higher probability of one data access patternbeing polluted by another compared with a finer granularitystructure such as a cache line or data-structure element.

Cuesta et al. [9] recently presented an efficient cachecoherence directory based on a runtime data classificationscheme similar to the one used in R-NUCA. The proposedscheme saves 57% of the coherence directory entries bydistinguishing shared data blocks from private ones andmaintaining directory entries for only the shared data blocks.When the classification of a data page experiences a transitionfrom private to shared, as detected by the OS, a coherencerecovery process is invoked to recover the coherence states.

Some application behavior has great impact on intercon-nect, as explored in Shao et al.’s work [10]. In this work thecommunication pattern of message passing applications isextracted statically to guide an optical circuit-switching inter-connect configuration to minimize runtime circuit establish-ment overhead. TLBs can also benefit from application databehavior to provide fast virtual to physical address transla-tion [16].

In the above research efforts, data classification andpatterndetection are applied to improve system performance andefficiency in many scenarios. In this paper, we present alightweight compile-time data access analysis that utilizes anovel data classification of practically private data.We do notrely on expensive and conservative analysis techniques withlimited practical application and avoid the need for runningand profiling programs, which can result in misleadingresults due to the training workload. Additionally, by com-pleting the analysis at compile-time, we can avoid dataclassification pollution due to the relative coarse granularityof pages by guiding the memory allocator to group dataclassifications together within pages. Our proposed dataclassification technique can be applied to improve the perfor-mance of different on-chip components including caches,TLBs, directories, interconnects, etc.

3 DATA CLASSIFICATION

A study of multi-threaded code from a variety of programdomains such as scientific computing, multimedia, imageprocessing and financial processing reveals that data struc-tures can be used in quite differentways.We also observe thatthe way data is usually used by multiple threads can beimplied by information such as where in the virtual memoryspace the data is allocated and how the references of data are

2906 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014

handled. For example, instructions and globally allocateddata such as synchronization structures are typically sharedby all threads. In contrast, stack and heap objects allocatedwithin a thread usually have very few sharers. From thesystem performance point of view, the access characteristicsof these data objects are so different that each of them shouldbe treated using a customized design. This section describesour approach to classify and identify the data classification atcompile time. Since a variety of multi-threaded benchmarksfeature extensive usage of dynamic memory allocation formanaging computed data, we concentrate on analyzing datablocks allocated through memory allocators such as malloc().The analysis approach can be extended to other memoryallocation routines such as new.

3.1 Motivation: The Concept of Practically PrivateUnderstanding the data access behavior in multi-threadedprograms is essential to deliver high performance on CMPs.Parallel applications tend to exhibit flexible and diverse dataaccess patterns. This poses a challenge for compiler to detectand describe these patterns. However, our study shows thatthere are several representative patterns/classifications thatexist in a variety of parallel applications and they dominatethe entire program execution. Consider the matrix multipli-cation example in Fig. 1. The code in Fig. 1(a) computes theproduct of matrix and vector and stores the result in .Fig. 1(b) shows a typical parallelized version of the sameproblem using POSIX threads, assuming the number of avail-able threads is 4 ( ). Both the source code and theillustration in Fig. 1(b) imply a clear data classification: andare partitioned across the 4 threads and thus can be classifiedas private while should be classified as shared since it isentirely shared by all threads. We will revisit this example inmore detail in Section 3.2 and elaborate on how such a dataclassification can be formally identified from source code.

Generally, the classification of data must meet certaincriteria to improve data access latency, save coherence direc-tory or reduce network traffic, in particular:

Different classifications within an application exhibitremarkable disparity in terms of locality, storage require-ments, and access latency and as such must be treateddifferently.Classifications are practical for the compiler to identify.Classifications are representative for a wide spectrum ofparallel applications.

A straightforwardmethod is to classify data blocks into twodistinct access categories: private versus shared, as shown in

the above example. Private data is accessed by only oneprocessor and thus is suitable to be placed locally to reduceaccess latency and promote locality. Coherence directory sizecan also be reduced by eliminating entries for private data [9].Conversely, shared data is accessed by more than one proces-sor and should be optimized using different methods thanprivate data. Shared data can be placed at a fixed locationindexed by its address or at the “center of gravity” of itsrequesters [17] to reduce coherence traffic, save directory en-tries and simplify searching of data, especially when the dataexhibits frequent/heavy sharing.

As a compiler-assisted data classification approach, ouranalysis must remain conservative when identifying privatedata to guarantee correctness. Unfortunately, identifyingdataprivacy requires complicated and NP-complete compileranalyses (e.g., data reuse analysis, inter-procedural analysis,data disambiguation, etc.), which drastically increase thecompilation overhead and may still fail to guarantee dataprivacy in some complicated cases such as calling proceduresthrough function pointers and accessing memory throughpointer arithmetics. To avoid these complications and expen-sive analyses, we extend our classification to include a thirdcategory: practically private data. Our classification is de-scribed as follows:

Private: In multi-threaded applications, a data block(returned by malloc()) is defined to be private if everyelement in it is accessed by only one thread in the parallelprogram section. This is the casewhenmultiple threads inthe program partition a data block exclusively withoutoverlap.Practically Private:Fordata that appears to beprivate butis not provable so from our compiler analysis, we call thisdata practically private. There are two scenarios that thisclassification covers and the same analysis discovers andoptimizes our system for both. First, data that is probablyprivate can practically be treated as private. Probablyprivate data is frequently entirely private at runtime, asdepicted in Fig. 2(a), but we must take safeguards to dealwith cases when sharing occurs. Second, if the practicallyprivate data is not entirely private, typically it isstill mostly private, as illustrated in Fig. 2(b) and (c).Fig. 2(b) indicates spatial mostly private data where eachthread operates on largely exclusive data regions withshared boundaries. Fig. 2(c) illustrates temporal mostlyprivate data onwhichmultiple threads operate exclusive-ly at most of the time (e.g., data is shared only at thebeginning or the end when threads are forked/joined).Typicallymostly private data has a low degree of sharingin terms of sharers and frequency of sharing, making itsuitable for private treatment in practice.

Fig. 2. Different scenarios for practically private data: (a) practicallyprivate-actually private, (b) practically private-mostly private (spatial), and(c) practically private-mostly private (temporal).

Fig. 1. Serial and parallel computation for matrix multiplication.

LI ET AL.: PRACTICAL DATA CLASSIFICATION FRAMEWORK FOR SCALABLE AND HIGH PERFORMANCE CMP 2907

Shared: This is the default classification when a datablock can neither be classified as private nor practicallyprivate.

By its definition, practically private can be regardedas a released version of the private data classification in theprior work to ease the compiler analysis and reduce theclassification detecting overhead. Additionally, the conceptof practically private addresses several issues raised by thepreviously proposed data classification [7], [9], which has arelatively rigid definition of private versus shared. First, theruntime classification scheme results in a “data pollution”problem that even a single shared element in one pagewouldlead to the whole page being classified as shared. Thisreduces the effectiveness of the data classification and candegrade performance for applications with boundary shar-ings (e.g., OCEAN, WATER, etc.). Another common casewhere the runtime data classification does not perform wellis when a data page is initialized by one thread but heavilyaccessed by another thread. The runtime scheme mis-clas-sifies the page as shared while private accesses are predomi-nant. All the above issues can be addressed by leveraging theconcept of practically private.

3.2 Data Classification DetectionIn order to detect data classification for dynamic memoryallocations, we use a structure called a reference list to keeptrack of all the pointers thatmay point to a particularmemoryblock. Initially, a reference list is created at each call site ofmalloc() and the return address is added into the reference list.Reference list updates utilize data flow analysis that traversesthe CDFG (control and data flow graph) of the analyzedprogram as follows: Let be a statement node in the CDFG,

be the list of the immediate successor nodes of ,be the list of the immediate predecessor nodes of ,

be the reference list state before executing andbe the state after executing . Each statement in the programhas two effects on and : and .generates a new reference list or adds a pointer in an existingreference list, depending on the format of . removes areference list or a pointerwithin it. For example, the statement

has the function of creating a reference listwith in it and the function of removing from its currentreference list. Likewise, the statement generates for thereference list that contains and kills from its currentreference list.

The data flow equation for updating the reference list canbe derived based on the above notions:

To illustrate the operation of this analysis, consider theexample shown in Fig. 3. The sample code on the left allocatesdata using malloc() and accesses the allocated data aftermultiple threads are forked. The analysis begins by construct-ing a controlflowgraph (CFG).As theCFG is being traversed,malloc() routines andpointer assignments are detected.Usingthe functions and data flow equations, referencelists are created and updated, as shown on the right hand sidein Fig. 3. Initially, the reference list set is empty. At the first

malloc() call site (labeled as ), is added into thereference list as a pure pointer. The next assignmentadds into the reference list that contains . When isreassigned by a second malloc() is removed from the refer-ence listwith andanewreference list is created towhich isadded with the succeeding assignment statement. When theconditional branch is encountered, is added into bothreference lists.

The classification of a data block is strongly implied by thepointerdereference features.Oneof such important features forthe compiler to discover is pointer dereference (array access)with Thread-Identifying Variables [18]. Thread-Identifying vari-ables (TI variables) are thread-local variables which haveunique values for different threads of execution. Typically,these variables are used to determine which memory blocksand in some cases which portion of the memory block thethread will access. The compiler must identify which variablesin the program are TI variables in order to determine how thepointers in the reference lists aredereferencedandusedby eachof the threads.

In multi-threaded applications, one common method toassign values to TI variables is to pass different values to eachof the parallel threads as function arguments during threadcreation, as shown in Fig. 4(a).

Another method for populating TI Variables is whenmultiple threads try to access and modify a global variableunder the protection of a mutex as illustrated in Fig. 4(b).Although this specific formof valuemodification canbe easilydetected, this type of code is flexible in general and cansometimes be difficult for a compiler to analyze. If the codeinvolves complicated methods to specify TI variables werequire the user to include a directive to assist the compilerin determining TI variables for analysis. In the example,#pragma TIV pid specifies pid as a TI variable.

Note that variables derived from other TI variables, eitherthrough calculation or assignment, are also TI variables. Oncethe basic TI variables are identified using the above methods,all other TI variables can be detected by performing a dataflow analysis similar to the reaching definition problem [19].

Given the notions of reference list and TI variables, weintroduce the data classifying rules as follows:

Determine Private Data: Data can be assured to beprivate if all the pointers in the associated reference listare of local scope (i.e., pointers are never passed to globalpointers or other threads).

Fig. 3. Pointer analyses example for data classification.

2908 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014

Determine Practically Private Data: Data on the stack isclassified into this category. Heap data blocks are alsoidentified as practically private if at least one pointer inthe reference list is of global scope (i.e., the pointer isglobal or passed to other threads) and that pointer isdereferenced with a TI variable.1 Typically, this indicatesthat these data blocks are probably private. In manyembarrassingly parallel applications, probably private isactually private when the global data blocks are exclu-sively partitioned among multiple threads by TI vari-ables. Even if the practically private data blocks are notactually private, it is likely that they aremostly private, asthe cases in many particle interaction simulation pro-grams where small amount of data is shared amongneighboring processing nodes (please refer to Fig. 2).Determine Shared Data: Data that can not be identifiedas belonging to the above two categories is classified asshared.

Aswe can see from the above rules, detecting private data isconservative, since any pointer in the reference list that impliesshared or practically private classification will overwrite theones that imply private classification. Determining practicallyprivate however, is aggressive. The data classifying rules aredesigned as such to maximally expose optimization opportu-nities while guaranteeing correctness. This is also revealedfrom Algorithm 1, as will be described in Section 3.5.

Given the data classification methodology described so far,let us now revisit the data-parallel matrixmultiplication exam-ple shown in Fig. 1(b). Without loss of generality, we shallassume that all the matrices are globally declared (we can notidentify them as private).Wefirst note that the variable is aTI variable. Since both and are expressions of

, they are also identified as TI variables using forward dataflow analysis. Variables and further serve asloop bounds, making the index a TI variable. Since thepointers and are accessed with the index , matrices andare thus classified as practically private based on the data

classification rules introducedabove. Similarly, is classifiedasshared. In this example, we can see that our data classificationscheme does not identifymatrices and as private data. Thisconservatism is to guarantee correctness in the presence ofcomplicatedpointerusage,whichmakesglobaldatadifficult tobe assured private. Furthermore, we expect treating privatedata as practically private will not significantly degradeperformance.2

3.3 Applicability to Programs with DynamicParallelism

Unlike data-parallel programs (e.g., the parallel matrix multi-plication) which statically partition workloads using threadIDs, applications with dynamic parallelism operate on theirinput by coordinatingmultiple threads based on certain sched-uling policies. For example, the PARSEC benchmark X264processes input video frames in a pipeline with a number ofstages equal to the number of encoder threads. The threadscommunicate with each other to resolve inter-frame depen-dencesandassignnewframes to idle threads.Another exampleis the file compressor PBZIP, which utilizes a producer-consumer parallelmodel to handle the inputfiles dynamically.Although these applications represent a different paradigm ofparallel data access pattern, carefully studying them revealsthat the concept of practically private still applies and thetechniques introduced in Section 3.2 can be used to identifypractically private data classification in these programs.

Compared to static data-parallel applications, whichtypically utilize thread IDs or other similar scalar variablesto partition data, dynamic parallel programs often use TIpointers3 to coordinate the data accesses among threads. TIpointers can be identified by extending the principle de-scribed in Fig. 4 to pointers. Specifically, a shared pointer canbe claimed as a TI pointer if its value is modified in a regionguarded by a mutex lock. Based on the notion of TI pointerswe augment the classification rule for determining practicallyprivate data with the condition where a TI pointer is present:

Determine Practically Private Data:Aheap data block isclassified as practically private if at least one pointer in thereference list is of global scope and that pointer is a TIpointer or is dereferenced with a TI variable.

Fig. 5 depicts a typical program with producer and con-sumer threads sharing a data queue. The producer examinesthe queue and puts initialized data at the tail of the queue aslong as the queue is not full. On the other hand, severalconsumer threads dynamically operate on the initialized datablocks from the head of the queue based on the current

Fig. 5. Practically private data in a program applying the producer-consumer parallel model.

Fig. 4. Identifying TI variables: (a) identifying TI variables passed asparameters and (b) identifying TI variables using directives.

1. We slightly modify this condition to determine practically privatedata for dynamic parallel programs, as detailed in Section 3.3.

2. In a directory based private cache organization, the only differencebetween private and practically private data is that private data allows amemory request to go directly to main memory upon a local last-levelcache miss thus bypassing the coherence directory query.

3. A TI pointer is a pointer that has the property of a TI variable. Inother words, multiple instances of a TI pointer may point to differentmemory addresses in different threads.

LI ET AL.: PRACTICAL DATA CLASSIFICATION FRAMEWORK FOR SCALABLE AND HIGH PERFORMANCE CMP 2909

progress and the work remaining. Due to the head and tailupdates in the atomic region (code between pthread_mutex_lock and pthread_mutex_unlock), memory blocks pointed toby > are identified as practically private. Inreality, the data queue in Fig. 5 is only initialized by theproducer thread andmostly accessed by one of the consumerthreads, leading to the temporal private accesses illustrated inFig. 2(c). Although for simplicity we only illustrate the exam-ple of one producer, our analysis on the practically privateconcept applies to the case where multiple producers exist. Inthat scenario, multiple producers individually generate andinitialize work for some consumer threads. The temporalprivacy, thus the practically private property, of a data blockstill remains, from the perspectives of the consumer consum-ing the block and the producer that has generated the block.

Fig. 6 shows a pipelined parallel programwith three stagesand two queues (q1 and q2), each of which is shared betweentwo adjacent stages. Each queue stores data from the priorstage and providesworkload for the next. In this example, thedata in the queues is accessed by exactly two threads andexhibits highly temporal locality, which can be discovered byour TI pointer based approach.

3.4 Data Classification for OpenMPThe concept of practically private data applies to manyparallel programming models including OpenMP. For exam-ple, with a specifically extended method to identify TIvariables, the data classification approach introduced inSection 3.2 can be utilized to detect practically private datain OpenMP programs.

In OpenMP programs, one requirement of a TI variable isthat it has to be thread-private (either declared as defaultprivate or explicitly specified by the OpenMP clauses PRI-VATE, FIRSTPRIVATE or LASTPRIVATE). A thread-privatevariable becomes a TI variable if multiple instances of thevariable derive different values through mechanisms such asOMP_GET_THREAD_NUM() and OpenMP locks (OMP_SET_LOCK() and OMP_UNSET_LOCK()). Note that onespecial type of TI variable in OpenMP is the index variableof a loop following the directive #pragma omp parallel for.Once TI variables are detected, techniques introduced inSections 3.2 and 3.3 can be used to recognize practicallyprivate data. The other two categories in our data classifica-tion, namely private and shared, can be identified using thesame approach introduced in Section 3.2.

It is worth mentioning that the PRIVATE clause in theOpenMP specification does not imply a private data

classification in our terminology. This seeming inconformitystems from the semantic meaning of the OpenMP clausePRIVATE, which only dictates whether multiple instances ofa variable should be created, one for each thread. Additional-ly, data scope attribute clauses in OpenMP are restricted toscalar variables, typically pointers, not arrays or objects. Inthis sense, specifying the scope of a pointer does not affecthow many threads can access the data accessible through thepointer and thus its data classification. Therefore, the dataclassification approach introduced above is still necessary andvaluable for OpenMP programs.

3.5 Implementation

Algorithm1.Data classification algorithmongraph .

Input:Output: with malloc() function calls annotated

with data classificationbegin

for each node doif is an array access or pointer dereferencethen

array base;array offset;

;for each element do

if has been classified as shared thenif is global && is TIV then

re-classify as practically private;else

do nothing (stay shared);else if has been classified as privatethen

if is global && is not TIVthen

re-classify as shared;else if is global && is TIVthen

re-classify as practically private;else /*A is local*/

do nothing (stay private);else /* has not been classified yet*/

if is global && is not TIVthen

classify as sharedelse if is global && is TIVthen

classify as practically privateelse /*A is local*/

classify as privateelse

continue;mark unclassified memory allocations as shared

Fig. 6. Practically private data in a program applying the pipeline parallelmodel.

2910 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014

We use SUIF [20] to implement the necessary compileranalyses for the proposed data classification. SUIF provides aframeworknamedSharlit [14] to facilitate the implementationof data flow analyses. Sharlit first constructs a reduced flowgraph for the source program based on an extension ofTarjan’s fast path algorithm [21]. It then uses an iterator totraverse the reduced graph, calling user specified flow func-tions (e.g., the Kill and Gen functions described in Section 3.2)and applying meet rules (e.g., Eqs. (1) and (2)) at path joints,until solutions (e.g., reference lists andTI variables) are found.After the iterative process is complete, reference lists areattached to the nodes where pointers are dereferenced (e.g.,array accesses). In addition, every variable is identified eitheras TI or non-TI variable. The flow graph is then traversedin another pass during which an action routine that checksthe data classification rules at the array access points, such as

in Fig. 3, is called at every node. The algorithm for theaction routine is summarized in Algorithm 1, assuming

is a directed graph with nodes , edges , andan entry node . represents the reference list elementsthat are associated with pointer .

4 CASE STUDY: A DATA CLASSIFICATION AWARECMP ARCHITECTURE

The performance of CMPs is largely limited by the latency ofdata accesses,which is highly dependent on the organizationsof its memory caches connected using on-chip interconnect.As the number of cores in CMP systems increases, the latencyof the interconnect is becoming an even greater bottleneck. Asa result, elimination of remote data accesses and localizationof communication have beendemonstrated to be crucial to theperformance improvement [22]. A number of architecturaltechniques have been proposed to achieve this goal [23]–[27].In general, these techniques aim at a compromise between thetwo basic cache organizations, namely the static non-uniformcache architectures (S-NUCA) [15]4 and the per-core privatecache architectures [28]. The goal is promoting data proximitywhile efficiently utilizing the entire cache capacity with mini-mal coherence overhead. In this section we provide a casestudy of a CMP architecture that leverages the proposed dataclassification technique and utilizes this to optimize the de-sign of on-chip last level caches to achieve this goal.

4.1 Customized Memory AllocatorFor the underlying architecture to be data-classification-aware, we employ a modified memory allocator to store and

pass the compiler instrumented classification information tothe runtime system. Existing memory allocators such asmalloc() obtain the starting address of the heap from theoperating system (OS) for the requested data block andmaintain a list to keep track of allocated as well as freememory blocks in virtual heap space. Each newly allocatedblock is filled withmeta informationwithin its header includ-ing the block size (S) and allocation status (A) of the block, asillustrated in Fig. 7. The memory allocator also optimizes theallocated blocks by reducing memory fragmentation throughmechanisms such as eight-byte alignment, splitting and blockcoalescing, etc.

A similar mechanism can be used to provide support forefficient sharing of information between the compiler andarchitecture. To inform the hardware of the data classifica-tions identified by the compiler, we extend the prototype ofvoid *malloc(size_t size) to: void *malloc(size_t size, classifi-cation_t cls). The size parameter is retained from the originalversion of the malloc() and is the size of the requested datablock. The second parameter cls is automatically filled by thecompiler with the identified data classification. Upon anallocation, the invoked memory allocator adds the data clas-sification information to the header, now including a 2-bit Cfield, of the newly allocated memory block in addition to theoriginal meta information. To facilitate runtime utilization ofthe classification information at a page granularity, the modi-fied memory allocator aggregates blocks with the same clas-sification into the same pages, as illustrated in Fig. 7. In otherwords, data blocks with different classifications are not per-mitted to be allocated within the same page. Each page tableentry (and the corresponding TLB entry) is also augmentedwith thedata classification information.During the virtual-to-physical address translation, the classification information inthe page table entry is retrievedwith the physical address andcan be utilized by the runtime system.

4.2 Data Classification Aware CachingThis section describes how the data classification can beutilized by designing a CMP architecture with supportingcaching policies.

Fig. 8 depicts how the identified data classification infor-mation can be utilized in a typical CMP microarchitecture inwhich each node has a processing core, caches, TLBs, etc. L1instruction and data caches are local and private to each core.Physically, the L2 is tiled and distributed. Logically, it couldbe either private to its local node or shared among all nodes,depending on the data classification of the served L2 datablock. To support caching both private and shared data, each

Fig. 7. Data blocks maintained by the memory allocator.

Fig. 8. Architecture organization for data classification aware caching.

4. In this paper we use the terms S-NUCA and distributed sharedcaches interchangeably.

LI ET AL.: PRACTICAL DATA CLASSIFICATION FRAMEWORK FOR SCALABLE AND HIGH PERFORMANCE CMP 2911

cache block (in both L1 and L2) is augmented with anadditional two-bitfield, class, indicating its data classification.The class field of each cache block is filled with the classifica-tion information from the correspondingTLBentryduring theaddress translation process. Whenever a cache block needs tobe searched, placed or written back, the cache controllerconsults the class field before communicating with the corre-sponding nodes. This creates an illusion that both private andshared cache blocks are respected in their favored cacheorganizations, private and shared caches, respectively.

As in a distributed shared cache, an in-cache directory [2] isused to maintain the L1 coherence using the MESI protocol.Another on-chip sparse directory is provided to handle the L2coherence for only the practically private data. This savessignificant directory entries, compared to the traditional pri-vate cache organization in which each datum requires acoherence directory entry.

In particular, the placement and search policy after an L1miss is described as follows:

Shared: Shared data is statically distributed throughoutall the cache tiles as a function of its physical address. Thiskeeps a unique copy at a fixed location to maximizeeffective cache capacity, simplify data search and avoidthe need to keep coherence at the L2 level. Each sharedcache line in L2 is associated with an entry in the in-cachedirectory to maintain the L1 coherence.Practically Private: Practically private data is likely to beaccessed as private data, thus the local core would re-trieve the data from the local cache tile. However, becauseit is not guaranteed to be private it may be accessed byother cores. To ensure correctness and promote locality atthe same time, we place data of this category within thelocal cache tile of the requester (e.g., first touch access),and adopt the MESI protocol to maintain coherence, asperformed in a traditional private cache organization. Fordata that is shared by two or more cores this can result inreplication and reduced overall cache capacity. However,because the compiler has determined this is practicallyprivatewe expect this type of sharing to be infrequent andnot significantly harm overall capacity.

Private:Private data blocks are typically accessed by onlyone core and thus cache coherence actions can be savedfor improved performance and efficiency. For example,on a L1miss of a private access, the local L2 cache bank isdirectly checked for the requested data. Upon a L2 missthe data is directly obtained from the main memory andfilled into the cache as if it were a private L2 schemewithout a coherence directory. A side-effect of this ap-proach is that data elements that are privately accessed bydifferent cores can co-exist in a cache line can result in falsesharing. To allow private data to be cached withoutcoherence overhead, we discuss potential solutions inSection 4.3 to address the false sharing issue.

Considering the interaction among cache blocks with dis-tinct classifications, the placement, eviction and coherencebehaviors need to be carefully handled so that cache blocks ofall types are respected. To illustrate the proposed scheme, weshow a number of examples in Fig. 9, where the classificationof a cache block ormemory request is indicated by its fill colorand the MESI state is represented as a letter within a paren-theses attached to the cache block. We use dashed and solidlines to distinguish the eviction (including replacement, writeback, and directory notification) and data lookup (includingdata search, directory lookup, and data reply), respectively.The examples illustrate where the requested data is fetched,omittingMESI state changes, upon a local L1miss for variousdata classification scenarios.

Fig. 9(a) illustrates a issuing a practically privatememory request which misses in its local cache (L1 and L2)and fetches the block from the owner’s L2 after evicting theshared victim1 in the (M) state and the shared victim2 in the (S)state. To accomplish this (1) the chooses victim1 forreplacement and (2) because victim1 has shared classification,writes back to victim1’s home,which is determined by victim1’sphysical address. At step (3), the local L2 is probed for therequested block since the request is practically private. Uponan L2 miss, victim2 is selected for replacement. Becausevictim2 is shared, the block and all its sharers are evictedat step (4). Note that this eviction is not a necessary compo-nent of our protocol but a standard procedure to maintain

Fig. 9. Examples of data flowand coherence protocol for different data classifications: (a) practically privatememory request (L1miss L2 hit), (b) sharedmemory request (L1 miss owner L1 hit), and (c) private memory request (L1 miss L2 miss).

2912 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014

the inclusion property [29], which simplifies the cache co-herence and allows the states of L1 sharers to be stored withthe corresponding L2 tags.5 After the eviction, the directoryat the home node of the originally requested data is searchedin step (5) and the message is forwarded to the owner in step(6). We refer to the owner as the node that has the only validcopy of the requested block with an (E) or (M) state. Anycache can be the owner if the directory indicates the block isin the (S) state. Finally, step (7) returns the requested block tothe requester. The coherence states are updated to (S) upon aread, to (M) and (I) as appropriate upon a write.

Fig. 9(b) explains the situation when a shared memoryrequest replaces a practically private block in (M) state andthen obtains the block from the owner’s L1. Steps (1) and (2)show the victim replacement andwrite back. Since the requestis shared, the homenode is checked in step (3) and the in-cachedirectory indicates the owner has the only valid copy in its L1in step (4). Step (5) returns the block to the requester. If therequested block is (M) or (S) in the in-cache directory, theblock can be directly returned to requester without step (4).

In Fig. 9(c), a private memory request replaces the sharedvictim1 in the (E) statedetermined in step (1), resulting in an in-cache directory update (step (2)). Since the request is private,the block is either in the local L2 or off-chip. A miss in L2will trigger an access to main memory. In this example, apractically private victim2 in the (S) state is replaced (step(3)) and the distributed directory at victim2’s home is noti-fied to make necessary changes (e.g., remove the victim2 as asharer) in step (4). Finally at step (5), the requested line isfetched from main memory and placed in the local L2 bank.

4.3 Addressing False Sharing for Private DataMultiple private data blocks with different owners6 residingin the same cache block are susceptible to false sharing issues,as illustrated in Fig. 10 (the C, S and A fields have the samemeaning as defined in Section 4.1). False sharing can be asevere problem in certain scenarios such as when a largenumber of small-sized private memory blocks are allocatedby different threads in a non-contiguous fashion. Threadmigration can also result in false sharing issues where aprivate data block is detected as shared since it is accessedby a different core after a migration.

False sharing defeats data-classification-aware cachingpolicies (see Section 4.2) that eliminates coherence overheadfor private memory blocks and when improperly addressedcan result in incorrect system function.

False sharing problems can be addressed using previouslyproposed approaches such as utilizing per-word valid bits incache blocks [30], [31], at the expense of considerable hard-ware overhead. Alternatively, false sharing can be addressedby enforcing a cache block size alignment when allocatingprivate memory blocks. However, this may lead to memoryfragmentation and underutilization of memory resources.The memory underutilization and fragmentation issue couldbe potentially mitigated by an advancedmemory allocator. Ifthe memory allocator is aware of the thread allocating aparticular memory block (e.g., by keeping an independentmemory pool for each thread or annotating each private blockwith its owner) it can avoid placing private data blocks fromdifferent threads into the same cache block.

For all the studied benchmarks, shared and practicallyprivate classifications dominate the data accesses. Theamount of private data is small (see Section 5). Thus, weemploy the simplest solution, to align data blocks with thecache block size. The wasted space due to the cache-block-aligned memory allocation for private data is not significant.This simpleapproachalsoallowsus toaddress the false sharingissue caused by thread migration since the compiler-detectedaligned private data blocks stay private during the entiremigration process as it exhibits a thread-private access patternon both the source and the destination core.

5 EVALUATION

In this section, we examine the compiler-classified data accessmodes and their use in a case study of classification-awarecache organization in a CMP. We first study the accuracy ofour compiler-based data classification approach by compar-ing its results with classification based on run-time profiling.Second our case study examines how the classification awareCMP performs compared with two baseline cache organiza-tions, distributed shared and private, and a state-of-the-art,runtime page-level data classification mechanism R-NUCA7

[7] (see Section 2).To conduct our experiments, we use Wind River Simics

[32] as our simulation environment and implemented therelevant caching schemes within it. The target architecture isa tiled CMP consisting of 16 SPARC2-way processors laid out

Fig. 10. False sharing in a private page.

TABLE 1Architecture Configurations

5. The amount of evictions is a function of the last level cache capacityand thus it is only affected by cache block replication, not by our dataclassification.

6. We refer to the core or thread that exclusively accesses a privatedata block the owner of that block.

7. We simulate only the data classification component of R-NUCA asthe code page replication and clustering is orthogonal to data classifica-tion and can be applied to many caching schemes including ours.

LI ET AL.: PRACTICAL DATA CLASSIFICATION FRAMEWORK FOR SCALABLE AND HIGH PERFORMANCE CMP 2913

as a mesh. The network contention due to data andcontrol messages is modeled using a FIFO link reservationmechanism. The detailed architecture parameters are pre-sented in Table 1.

We selected a variety of benchmarks from the SPLASH 2[11], PARSEC 2 [12] and RODINIA [13] benchmark suites aslisted in Table 2. We study the data classification accuracyand effectiveness for all the benchmarks to demonstrate thatthe proposed technique is a general approach that handles awide range of parallel applications with different parallelmodels (i.e., data-parallel, pipelined and producer-consumer)and frameworks (i.e., Pthreads andOpenMP). Forperformanceimpact we focus on the parallel benchmarks implementedusing Pthreads. The size of the dataset can bias the resultstoward a particular type of cache, one that favors private—i.e.,a small dataset/working set that can easily fit into the cacheeven with significant replication, and one that favors shared—i.e., a larger dataset/working set where cache capacity is at apremium. Unfortunately, the problem size of our simulationswas limited due to the intractably long simulation times oflarge workloads. Thus, to simulate these two conditions, weuse two configurations, shared-averse and private-averse, witheach configuration/workload enumerated in Table 2.

In the shared-averse configuration, the aggregate L2 cachecapacity is configured as 16M and the smaller of two work-loads is selected. This favors private caches as the applicationcan still remain relatively cache bound even if a significantamount of data replication occurs. To simulate a private-averse systemwe reduced the cache size to 8M bytes to makethe overall capacity a bigger factor and used the largest

possible workload. All other system parameters for shared-averse and private-averse configurations are the same, asshown in Table 1.

5.1 Impact of the Memory AllocatorThemodification to thememory allocator described in Section4.1 requires very little overhead in terms of code complexity.Essentially, three pools of free space are maintained the thepool to access is selected based on the additional classificationparameter. The major potential overhead is from additionalmemory fragmentation from the three memory pools. Theincrease in fragmentation from the modified allocator isshown in Fig. 11. The overall impact is relatively insignificant1.2% increase on average in fragmentation. For most applica-tions the increase is practically 0%.

However, use of this allocator can have an impact onclassification pollution. Thus, to highlight the benefit of theclassification scheme and to achieve a fair comparison withruntime schemes, in our subsequent experiments we keep thedata classification information generated by the modifiedmemory allocator in a separate table accessible by the runtimesystemwhile retaining the original memory allocator’s mem-ory layout and allocation.

5.2 Compiler-Based Data ClassificationTo demonstrate the effectiveness of the proposed data classi-fication methodology, we show in Fig. 12, the percentage ofdata accesses in each category during the application execu-tion. Except for FFT, which is dominated by shared accesses,

TABLE 2Benchmarks

Fig. 11. Percentage increase in memory fragmentation due to the classi-fication aware memory allocator.

Fig. 12. Percentage of accesses classified by the compiler as shared,practically private, and private.

2914 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014

typically a significant amount of the accesses are practicallyprivate. For benchmarks such as BLACKSCHOLES, LUD,NW and HEARTWALL, practically private dominates thedata accesses. On average,more than 60% of the data accessesare practically private. We draw conclusion, from the aboveresults, that practically private is commonly present in paral-lel applications and has a potentially large impact on systemscalability, efficiency and performance. We next examine theaccuracy of the proposed compiler-based data classificationtechnique.

Figs. 13 and 14 show the actual sharing behavior at run-time for the data that is identified as practically private.Fig. 13 reports the percentages of practically private datablocks with different numbers of sharers. For all the testedbenchmarks,more than 50%of all the practically private datablocks are actually private. Most benchmarks exhibit pre-dominant amount of private blocks among all the datablocks identified as practically private. On average, over80% practically private data blocks are verified to be private.Fig. 14 presents the percentages of data accesses to allpractically private data blocks with one, two, three or moresharers. For many benchmarks the percentage of privateaccesses within practically private data is not as high as thecorresponding percentage of private data blocks. For exam-ple, WATER-S exhibits over 70% private data blocks onwhich the accesses comprise only around 40% of the totalaccesses. This is because a data block with more sharers islikely to be heavily accessed, compared to a private block.However, private accesses still contribute an average of 77%of all the accesses on practically private data.

Figs. 15 and 16 report the runtime sharing behavior for thedata classified as shared. On average, only 19% of the datablocks classified as shared turn out to be private while morethan 80% of the compiler classified shared data has two ormore sharers. Compared to the percentage of data blocks, thepercentage of accesses further supports the classificationFig. 16 shows that only 5% of the data accesses occur onprivate blockswhile the remainder occur on shared blocks. By

comparing with the runtime sharing behavior of the practi-cally private datawe can conclude that the shareddata exhibitpredominantly sharing access pattern, indicating an effectivedata classification by the compiler.

5.3 Effect on Coherence TrafficTraditional coherence protocols incur large volumes of coher-ence messages, especially for heavily accessed data with nu-merous sharers. This impedes the scaling of future many-coreCMPs. Our technique significantly reduces the number ofcoherence messages since it only maintains the coherence fordata identified as practically private data, which are likely tohave few or no sharers. Fig. 17 reports the percentage ofreduced coherence traffic compared with private caches withtheMESI coherence protocol. Adopting our classification tech-nique eliminates from 11% to 78% of the coherence traffic. Onaverage, coherence traffic is reduced by 46%.

5.4 Performance EvaluationTo evaluate the performance of the proposed caching schemethat uses compiler-based data classification: private, shared,and practically private (PSP), we compare our approach withdistributed shared [15], private [28] andR-NUCA [7] caches interms of cache miss rate, average memory access latency andspeedup.

5.4.1 Miss RateFigs. 18 and 19 show the L2 cache miss rate for shared-averseand private-averse configurations, respectively, each normal-ized to the distributed shared cache. In general, distributedshared caches have the lowest miss rate because replication isnot allowed and as such the cache capacity is used mosteffectively. Conversely, in the private cache organization,mul-tiple cache blocks become replicated and consumemore capac-ity, typically resulting in a higher miss rate. R-NUCA has anundesirable miss rate for some benchmarks, especially thoseexhibiting low raw misses, due to the impact of its page re-classification mechanism on cold start misses. When a pageinitially classified as private is re-classified as shared, all the

Fig. 13. Percentages of data blocks classified as practically private thatare accessed by one core (private), two cores, or three or more cores.

Fig. 14. Percentages of accesses to the data blocks classified as practi-cally private that are accessed by one core (private), two cores, or three ormore cores.

Fig. 15. Percentages of data blocks classified as shared that are ac-cessed by one core (private), two cores, or three or more cores.

Fig. 16. Percentages of accesses to the data blocks classified as sharedthat are accessed by one core (private), two cores, or three ormore cores.

LI ET AL.: PRACTICAL DATA CLASSIFICATION FRAMEWORK FOR SCALABLE AND HIGH PERFORMANCE CMP 2915

cache blocks within the page that have been cached must beinvalidated, resulting in a higher miss rate although the totalnumber of misses could be quite low. For longer runningapplications with larger number of iterations R-NUCA’s re-classification misses can be largely amortized and in such ascenario access latency is more critical than the reclassificationmiss rate.

Themiss rate of the proposed PSP is typically in themiddleand often approaches the better of the two baseline cachingschemes based on the conditions. For shared-averse, PSP isbetter than shared and approaches the quality of private. Forprivate averse, PSP is better than private and approaches thequality of shared.

5.4.2 LatencyAverage memory access latencies of all relevant schemes arereported in Figs. 20 and 21. Data access latency is affected byboth miss rate and on a hit, the distance that must betraversed to retrieve the data from a potentially remote tile.In distributed shared caches, most data is stored in a remotetile from the core that heavily accesses it, resulting in a higherlatency, especially in the shared-averse configuration. Incontrast, private caching absorbs all the data to the local tileand thus has a lower hit latency because off tile cacheaccesses are minimized. This is true especially when theworking set size does not exceed the cache capacity, asshown in Fig. 20 for a shared-averse configuration. As thecache capacity is pressured by an increasingworking set, thelatency is dominated by off-chipmisses and the performancebegins to degrade, as demonstrated in Fig. 21.

R-NUCA also suffers partly from high access latencysimilar to distributed shared caches when the pages areclassified as shared, although it reduces the access latencywhenpages are initially accessed locally or remain private to aparticular processor. Another problem of R-NUCA data clas-sification is that the page granularity makes it impossible tooptimize smallermemory blocks. One byte of shared access ina private page results in the whole page being re-classified asshared. Additionally, R-NUCA is an “all-or-nothing” ap-proach. For a single access by another core the page is classi-fied as shared even if this is an uncommon or one timeoccurrence (e.g., data initialized by main thread but used byanother working thread) and a private style scheme will stillsave considerable data access latency.

Our compiler-assisted caching addresses these problemsthrough customized placement policies for classified data,packing a page with data of the same classification, and fortolerating a stray shared access in a practically private configu-ration. In the shared-averse configuration, PSP reducesmemo-ry latency for distributed shared, private andR-NUCAby ,

and , respectively. In private-averse configuration, thecorresponding latency reductions are , and .

5.4.3 Performance ImprovementFrom Fig. 22 we can see that in shared-adverse configurationour scheme, PSP, outperforms distributed shared caches by

Fig. 18. Miss rate for shared-averse configuration.

Fig. 17. Percentage of coherence traffic reduced compared to privatecaches.

Fig. 19. Miss rate for private-averse configuration.

Fig. 20. Average memory access latency for shared-averseconfiguration.

Fig. 21. Averagememoryaccess latency for private-averseconfiguration.

Fig. 22. Speedup for shared-averse configuration.

2916 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014

while still providing noticeable gains over private ( )and R-NUCA ( ). These gains are from keeping the datalocal to the core(s) that leverage it and reduction in coherencetraffic. R-NUCA’s performance suffers from data PSP classi-fies as practically private being categorized as shared andrequiring longer access latency than PSP. Fig. 23 indicates thatfor largeworking set sizes (private-averse configuration), PSPoutperforms shared, private and R-NUCA caches by ,and , respectively.

6 CONCLUSION

In this paper we have presented a compiler-assisted dataclassification of “practically private” to capture importantapplication characteristics while significantly reducing dataclassification complexity and overhead. To demonstrate theutilization of the introduced classification, we have proposeda cache that uses a hybrid of private and S-NUCA cachingdirected by the compiler data classification. Comparing withother cache mechanisms, in particular R-NUCA in additionto traditional private and shared caches, our cache schemeis unique as it utilizes the data classification available atcompile time, thus, eliminating the need for potentiallyexpensive runtime monitoring and profiling. Through thedata-classification-aware cache organization we can achieveup to 12%, 8% and 5% performance improvements overshared, private and R-NUCA caching, respectively. Futuredirections for this work include expanding our analysis todetermine the weight of private access for shared or practi-cally private data classifications. We also plan to explore amore aggressive compiler analysis based on data partitioningwithin data blocks from malloc().

ACKNOWLEDGMENTS

This work is supported by NSF award CCF-1064976. Thisworkwas partially supported byNSF award CCF-0702452. Apreliminary version of part of this work appeared in the 2012Proceedings of PACT.

REFERENCES

[1] R.Manikantan,K.Rajan, andR.Govindarajan, “Probabilistic sharedcache management (Prism),” SIGARCH Comput. Archit News,vol. 40, no. 3, pp. 428–439, Jun. 2012. [Online]. Available: http://doi.acm.org/10.1145/2366231.2337208.

[2] J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, “Atagless coherence directory,” in Proc. 42nd Ann. IEEE/ACM Int.Symp. Microarchit. (MICRO 42), ACM, 2009, pp. 423–434.

[3] M. K. Qureshi, “Adaptive spill-receive for robust high-performancecaching in CMPs,” in Proc. Int. Symp. High Perform. Comput. Archit.(HPCA), 2009, pp. 45–54.

[4] J. Chang and G. S. Sohi, “Cooperative caching for chip multiproces-sors,” in Proc. Int. Symp. Comput. Archit. (ISCA), 2006, pp. 264–276.

[5] H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan, “SPATL:Honey, I shrunk the coherence directory,” in Proc. Int. Conf. ParallelArchit. Compilation Techn. (PACT’11), 2011, pp. 33–44.

[6] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Express virtualchannels: Towards the ideal interconnection fabric,” in Proc. Int.Symp. Comput. Archit. (ISCA), 2007, pp. 150–161.

[7] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “ReactiveNUCA:Near-optimalblockplacement and replication indistributedcaches,” in Proc. 36th Ann. Int. Symp. Comput. Archit. (ISCA’09),ACM, 2009, pp. 184–195.

[8] L. Jin andS.Cho,“SOS:Asoftwareorienteddistributed sharedcachemanagement approach for chip multiprocessors,” in Proc. Int. Conf.Parallel Archit. Compilation Techn. (PACT), 2009, pp. 361–371.

[9] B. A. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. F. Duato,“Increasing the effectiveness of directory caches by deactivatingcoherence for private memory blocks,” in Proc. 38th Ann. Int. Symp.Comput. Archit. (ISCA’11), ACM, 2011, pp. 93–104.

[10] S. Shao, A. K. Jones, and R. Melhem, “Compiler techniques forefficient communications in circuit switched networks for multipro-cessor systems,” IEEE Trans. Parallel Distrib. Syst. (TPDS), vol. 14,no. 1, pp. 331–345, Mar. 2009.

[11] J. M. Arnold, D. A. Buell, and E. G. Davis, “Splash 2,” in Proc. ACMSymp. Parallel Algorithms Archit., ACM, 1992, pp. 316–322.

[12] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The Parsec benchmarksuite: Characterization and architectural implications,” PrincetonUniv., Tech. Rep. TR-811-08, Jan. 2008.

[13] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, andK. Skadron, “A characterization of the Rodinia benchmark suitewith comparison to contemporary CMP workloads,” in Proc. IEEEInt. Symp. Workload Characterization (IISWC), Dec. 2010, pp. 1–11.[Online].Available: http://dx.doi.org/10.1109/IISWC.2010.5650274.

[14] S. W. K. Tjiang and J. L. Hennessy, “Sharlit: A tool for buildingoptimizers,” Proc. ACM SIGPLAN Conf. Program. Language DesignImplementation (PLDI’92), ACM, 1992, pp. 82–93.

[15] C. Kim, D. Burger, and S. W. Keckler, “Nonuniform cache archi-tectures for wire-delay dominated on-chip caches,” IEEE Micro,vol. 23, no. 6, pp. 99–107, Nov./Dec. 2003.

[16] Y. Li, R. Melhem, and A. Jones, “Leveraging sharing in second leveltranslation-lookaside buffers for chip multiprocessors,” Comput.Archit. Lett., vol. PP, no. 99, p. 1, 2011.

[17] M. Hammoud, S. Cho, and R. G. Melhem, “Cache equalizer: Aplacement mechanism for chip multiprocessor distributed sharedcaches,” in Proc. 6th Int’l Conf. High Perform. Embedded Archit.Compilers (HiPEAC’11), ACM, 2011, pp. 177–186.

[18] Y. Li, A. Abousamra, R. Melhem, and A. K. Jones, “Compiler-assisted data distribution for chip multiprocessors,” in Proc. 19thInt. Conf. Parallel Archit. Compilation Techn. (PACT’10), ACM, 2010,pp. 501–512.

[19] A.V.Aho,M. S. Lam,R. Sethi, and J. D.Ullman,Compilers: Principles,Techniques, and Tools, 2nd ed. Reading, MA, USA: Addison Wesley,2006.

[20] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarsinghe, J. M.Anderson, S.W. K. Tjiang, S.W. Liao, C.W. Tseng,M.W.Hall,M. S.Lam, and J. L. Hennessy, “SUIF: An infrastructure for research onparallelizing and optimizing compilers,” in Proc. SIGPLAN Notices,1994, pp. 31–37.

[21] R. E. Tarjan, “Fast algorithms for solving path problems,” J. ACM,vol. 28, pp. 594–614, Jul. 1981.

[22] A. Abousamra, R. Melhem, andA. K. Jones, “Winningwith pinningin NoC,” in Proc. IEEE Hot Interconnects, 2009, pp. 13–21.

[23] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, “Optimizingreplication, communication, and capacity allocation in CMPs,” inProc. Int. Symp. Comput. Archit. (ISCA), 2005, pp. 357–368.

[24] H. Dybdahl and P. Stenstrom, “An adaptive shared/private NUCAcache partitioning scheme for chip multiprocessors,” in Proc. Int.Symp. High Perform. Comput. Archit., 2007, pp. 2–12.

[25] J. Chang and G. S. Sohi, “Cooperative caching for chip multipro-cessors,” in Proc. 33rd Int. Symp. Comput. Archit., 2006, pp. 264–276.

[26] M. Zhang and K. Asanovic, “Victim replication: maximizing capac-ity while hiding wire delay in tiled chip multiprocessors,” in Proc.32nd Ann. Int. Symp. Comput. Archit., 2005, pp. 336–345.

[27] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-waymultithreaded sparc processor,” IEEEMicro, vol. 2, no. 25, pp. 21–29,Mar./Apr. 2005.

[28] J. A. Brown, R. Kumar, and D. M. Tullsen, “Proximity-awaredirectory-based coherence for multi-core processor architectures,”in Proc. ACM Symp. Parallel Algorithms Archit., 2007, pp. 126–134.

Fig. 23. Speedup for private-averse configuration.

LI ET AL.: PRACTICAL DATA CLASSIFICATION FRAMEWORK FOR SCALABLE AND HIGH PERFORMANCE CMP 2917

[29] J.-L. Baer and W.-H. Wang, “On the inclusion properties for multi-level cache hierarchies,” in Proc. 15th Ann. Int. Symp. Comput. Archit.(ISCA’88), IEEE Computer Society Press, 1988, pp. 73–80.

[30] N. P. Jouppi, “Cache write policies and performance,” in Proc. 20thAnn. Int. Symp. Comput. Archit. (ISCA’93), ACM, 1993, pp. 191–201.[Online]. Available: http://doi.acm.org/10.1145/165123.165154.

[31] A. Ros and S. Kaxiras, “Complexity-effective multicore coherence,”in Proc. 21st Int. Conf. Parallel Archit. Compilation Techn. (PACT’12),ACM, 2012, pp. 241–252. [Online]. Available: http://doi.acm.org/10.1145/2370816.2370853.

[32] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G.Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner,“Simics: A full system simulation platform,” Computer, vol. 35,no. 2, pp. 50–58, Feb. 2002.

Yong Li received the BS degree in telecommuni-cation engineering from Chongqing University,China, in 2005, the MS degree in computerengineering from the University of Pittsburgh,Pennsylvania, in 2010, and is currently workingtoward the PhD degree in computer engineering.After completing the BS degree, he worked inHuawei as a telecommunication systemengineer.His research interests include high-performancecomputer architectures, compilers, and on-chipmemory systems.

RamiMelhem received theBEdegree inelectricalengineering from Cairo University, Giza, Egypt, in1976, the MA degree in mathematics, and theMSdegreeincomputersciencefromtheUniversityof Pittsburgh, Pennsylvania, in 1981, and the PhDdegree in computer science from the University ofPittsburgh, in 1983.Hewasanassistant professorat Purdue University, West Lafayette, Indiana,prior to joining the faculty of The University ofPittsburgh in 1986, where he is currentlyaprofessorofcomputerscienceandwherehewas

the chair of the Computer Science Department from 2000 to 2009.His research interests include power management, real-time and fault-tolerant systems, optical networks, high-performance computing, andparallel computer architectures. He served on program committees ofnumerous conferences and workshops. He was on the editorial board ofthe IEEE Transactions on Computers (1991-1996), the IEEE TransactionsonParallelandDistributedsystems (1998-2002), theComputerArchitectureLetters (2001-2010), and the Journal of Parallel and DistributedComputing(2003-2011). He is serving on the advisory boards of the IEEE technicalcommittees on Computer Architecture. He is a member of the ACM.

Alex K. Jones received the BS degree in physicsfrom the College of William and Mary, Virginia, in1998, and the MS and PhD degrees in electricaland computer engineering from NorthwesternUniversity, Evanston, Illionois, in 2000 and2002, respectively. He is currently the director ofcomputer engineering and an associate professorof electrical and computer engineering and com-puter science at the University of Pittsburgh,Pennsylvania. He is a Walter P. Murphy Fellowof Northwestern University. His research interests

include compilation techniques for configurable systems and architec-tures, behavioral and low-power synthesis, parallel architectures andnetworks, radio frequency identification (RFID), and sensor networks forsustainablesystems.He is theauthorofmore than70publications in theseareas and has served on numerous program committees and been an AEfor several journals in these areas. He is an active leader of ACM SIGDA,the past vice chair and current technical activities chair. He is currently aseniormember of the ACMand has received several awards including the2010ACMDistinguishedServiceAward, Top25paper of the first 20 yearsof the FCCM Conference and Best Paper Award of the 2013 GLSVLSISymposium, amongst others.

▽ For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

2918 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 12, DECEMBER 2014