13
1 CARMA: Contention-aware Auction-based Resource Management in Architecture Farshid Farhat, Student Member, IEEE , Diman Zad Tootaghaj, Student Member, IEEE Abstract—As the number of resources on chip multiprocessors (CMPs) increases, the complexity of how to best allocate these resources increases drastically. Because the higher number of applications makes the interaction and impacts of various memory levels more complex. Also, the selection of the objective function to define what “best” means for all applications is challenging. Memory-level parallelism (MLP) aware replacement algorithms in CMPs try to maximize the overall system performance or equalize each application’s performance degradation due to sharing. However, depending on the selected “performance” metric, these algorithms are not efficiently implemented, because these centralized approaches mostly need some further information regarding about applications’ need. In this paper, we propose a contention-aware game-theoretic resource management approach (CARMA) using market auction mechanism to find an optimal strategy for each application in a resource competition game. The applications learn through repeated interactions to choose their action on choosing the shared resources. Specifically, we consider two cases: (i) cache competition game, and (ii) main processor and co-processor congestion game. We enforce costs for each resource and derive bidding strategy. Accurate evaluation of the proposed approach show that our distributed allocation is scalable and outperforms the static and traditional approaches. Index Terms—Game Theory, Resource Allocation, Auction. 1 I NTRODUCTION T HE number of cores on chip multiprocessors (CMP) is increasing each year and it is believed that only many- core architectures can handle the massive parallel applica- tions. Server-side CMPs usually have more than 16 cores and potentially more than hundreds of applications can run on each server. These systems are going to be the future generation of the multi-core processor servers. Applications running on these systems share the same resources like last level cache (LLC), interconnection network, memory controllers, off-chip memories or co-processors where the higher number of applications makes the interaction and impacts of various resource levels more complex. Along with the rapid growth of core integration, the performance of applications highly depend on the allocation of the re- sources and especially the contention for shared resources [1–11]. In particular, as the number of co-runners running on a shared resource increases, the magnitude of performance degradation increases. Also, the selection of the objective function to define what “best” means for all applications is challenging or even theoretically impossible to improve IPC of one application and memory latency of another application simultaneously in a system. As a result, this new architectural paradigm introduces several new challenges in terms of scalability of resource management and assignment on these large-scale servers. Therefore, a scalable compe- tition method between applications to reach the optimal assignment can significantly improve the performance of co- runners on a shared resource. Figure 1 shows an example of performance degradation for 10 spec 2006 applications when We thank, Novella Bartolini and Mohammad Arjomand for their feedback on earlier drafts of this paper. D. Z. Tootaghaj and Farshid Farhat are with the Comp. Sci. Dept. in the Penn- sylvania State University, PA, USA (email: {dxz149, fuf111}@cse.psu.edu). A partial and preliminary version appeared in Proc. IEEE ICCD’17 [1]. 0 0.5 1 1.5 2 2.5 3 mcf bzip2 gcc stream bwaves milc hmmer omnetpp libquantum astar IPC Shared (IPC) Separate (IPC) Slow Down Fig. 1: Performance degradation of 10 different spec 2006 applications sharing LLC. running on a shared 10MB LLC (Shared), or when running on a private 1MB LLC (Separate). Among these shared resources, sharing CPUs and LLCs plays an important role in overall CMP utilization and performance. Modern CMPs are moving towards hetero- geneous architecture designs where one can get advantage of both small number of high performance CPUs or higher number of low performance cores. The advent Intel Xeon Phi co-processors is an example of such heterogeneous architec- tures that during run-time the programmer can decide to run any part of the code on small number of Xeon processors or higher number of Xeon Phi co-processors. Therefore, the burden of making decisions on getting the shared resources is moving towards the applications. In addition to the shared CPUs, shared LLC keeps data on chip and reduces off-chip communication costs [12]. Sometimes an applica- tion may flood on a cache and occupy a large portion of available memory and hurt performance of another appli- cation which rarely loads on memory, but its accesses are usually latency-sensitive. Recently, many proposals target partitioning the cache space between applications such that arXiv:1710.00073v4 [cs.DC] 19 Jul 2018

1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

1

CARMA: Contention-aware Auction-basedResource Management in Architecture

Farshid Farhat, Student Member, IEEE , Diman Zad Tootaghaj, Student Member, IEEE

Abstract—As the number of resources on chip multiprocessors (CMPs) increases, the complexity of how to best allocate theseresources increases drastically. Because the higher number of applications makes the interaction and impacts of various memorylevels more complex. Also, the selection of the objective function to define what “best” means for all applications is challenging.Memory-level parallelism (MLP) aware replacement algorithms in CMPs try to maximize the overall system performance or equalizeeach application’s performance degradation due to sharing. However, depending on the selected “performance” metric, thesealgorithms are not efficiently implemented, because these centralized approaches mostly need some further information regardingabout applications’ need. In this paper, we propose a contention-aware game-theoretic resource management approach (CARMA)using market auction mechanism to find an optimal strategy for each application in a resource competition game. The applicationslearn through repeated interactions to choose their action on choosing the shared resources. Specifically, we consider two cases: (i)cache competition game, and (ii) main processor and co-processor congestion game. We enforce costs for each resource and derivebidding strategy. Accurate evaluation of the proposed approach show that our distributed allocation is scalable and outperforms thestatic and traditional approaches.

Index Terms—Game Theory, Resource Allocation, Auction.

F

1 INTRODUCTION

THE number of cores on chip multiprocessors (CMP) isincreasing each year and it is believed that only many-

core architectures can handle the massive parallel applica-tions. Server-side CMPs usually have more than 16 coresand potentially more than hundreds of applications can runon each server. These systems are going to be the futuregeneration of the multi-core processor servers. Applicationsrunning on these systems share the same resources likelast level cache (LLC), interconnection network, memorycontrollers, off-chip memories or co-processors where thehigher number of applications makes the interaction andimpacts of various resource levels more complex. Alongwith the rapid growth of core integration, the performanceof applications highly depend on the allocation of the re-sources and especially the contention for shared resources[1–11]. In particular, as the number of co-runners running ona shared resource increases, the magnitude of performancedegradation increases. Also, the selection of the objectivefunction to define what “best” means for all applicationsis challenging or even theoretically impossible to improveIPC of one application and memory latency of anotherapplication simultaneously in a system. As a result, this newarchitectural paradigm introduces several new challenges interms of scalability of resource management and assignmenton these large-scale servers. Therefore, a scalable compe-tition method between applications to reach the optimalassignment can significantly improve the performance of co-runners on a shared resource. Figure 1 shows an example ofperformance degradation for 10 spec 2006 applications when

We thank, Novella Bartolini and Mohammad Arjomand for their feedback onearlier drafts of this paper.D. Z. Tootaghaj and Farshid Farhat are with the Comp. Sci. Dept. in the Penn-sylvania State University, PA, USA (email: {dxz149, fuf111}@cse.psu.edu).A partial and preliminary version appeared in Proc. IEEE ICCD’17 [1].

0

0.5

1

1.5

2

2.5

3

mcf

bzip2

gccstream

bwaves

milc

hmm

er

omnetpp

libquantum

astar

IPC

Shared (IPC) Separate (IPC) Slow Down

Fig. 1: Performance degradation of 10 different spec 2006applications sharing LLC.

running on a shared 10MB LLC (Shared), or when runningon a private 1MB LLC (Separate).

Among these shared resources, sharing CPUs and LLCsplays an important role in overall CMP utilization andperformance. Modern CMPs are moving towards hetero-geneous architecture designs where one can get advantageof both small number of high performance CPUs or highernumber of low performance cores. The advent Intel Xeon Phico-processors is an example of such heterogeneous architec-tures that during run-time the programmer can decide torun any part of the code on small number of Xeon processorsor higher number of Xeon Phi co-processors. Therefore, theburden of making decisions on getting the shared resourcesis moving towards the applications. In addition to theshared CPUs, shared LLC keeps data on chip and reducesoff-chip communication costs [12]. Sometimes an applica-tion may flood on a cache and occupy a large portion ofavailable memory and hurt performance of another appli-cation which rarely loads on memory, but its accesses areusually latency-sensitive. Recently, many proposals targetpartitioning the cache space between applications such that

arX

iv:1

710.

0007

3v4

[cs

.DC

] 1

9 Ju

l 201

8

Page 2: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

2

(1) each application gets the minimum required space, sothat per-application performance is guaranteed to be at anacceptable level, (2) system performance is improved bydeciding how the remaining space should be allocated toeach one.

Prior schemes [3, 12–17] are marching towards these twogoals, usually by trading off the system complexity andmaximum system utilization. It is shown that neither a pureprivate LLC, nor a pure shared LLC, can provide optimalperformance for different workloads [6]. In general, cachepartitioning techniques can be divided into way partitioningand co-scheduling techniques. In a set-associative cache,partitioning is done by per-way allocation. For example, ina 4-way 512KB shared cache allocating 128KB to applicationA means to allow it storing data blocks in only one way per-set, without accessing remaining. Co-scheduling techniquestry to co-schedule a set of applications with lowest inter-ference together at the same time such that the magnitudeof slow-down for each application is the same or a perfor-mance metric is optimized for all applications. However,it is shown that, depending on the objective function forthe performance metric, cache allocation can result in totallydifferent allocations [4]. In general Prior schemes have thefollowing three limitations:

1. Scalability: All of the prior schemes suffer from scala-bility; especially when the approach is tracking the applica-tion’s dynamism [3, 13, 17]. The reason is that algorithmcomplexity becomes higher in dynamic approaches. Theroot cause of this complexity is that all previous tech-niques make decisions (cache partitioning, co-scheduling)centralized using a central hardware or software. For ex-ample, main algorithm of [13] has exponential complexityO((N+K−1K−1

)) whereN is the number of applications sharing

LLC and K is the number of ways. Table 1 shows the stateof the art cache partitioning algorithms and their complexityof checking performance of different permutations.

2. Static-based: Most of the prior works, use static co-scheduling to degrade slow-down of co-running applica-tions on the same shared cache. However, static-based ap-proaches cannot catch dynamic behavior of applications.Figure 2 shows an example of two applications’ IPC (hmmerand mcf ) from Spec 2006 under different LLC sizes. Let usconsider a case where we have two cache sizes, a large cacheof 1MB which can be shared between applications, and twoprivate caches of 512KB which are not shared. The twoapplications are competing for the cache space. Suppose thatboth applications have two phases (0, T ) and (T, 2T ). If thefirst application gets the larger cache space its IPC increasesby 35 percent in the first phase and by 20.6 percent in thesecond phase. The second application’s IPC increases by 15percent in the first phase and by 36.84 percent in the secondphase if it gets the larger cache space. In a static-basedscheduling approach, the larger LLC is always allocated tothe first application with higher IPC in the time interval(0, 2T ), but in CARMA, the applications compete for theshared resources, and in the first phase, the larger LLC isallocated to the first application and in the second phaseCARMA allocates the larger LLC to the second application.Therefore, static-based approaches cannot capture the dy-namism in application’s behavior and ultimately degrade

TABLE 1: Complexity comparison of state-of-the-art LLCpartitioning/co-scheduling algorithms.

Algorithm Search SpaceUtility-based main algorithm [13]

(N+K−1N−1

)Greedy Co-scheduling [17]N applications and N/K caches

(NK

)Hierarchical perfect matching [17]N applications N4

Local optimization [17]N applications and N/K caches (N/K)2

(2KK

)CARMAN applications and K resources O(NK)

the performance significantly.3. Fairness: Defining a single parameter for fairness is

challenging for multiple applications, since applicationshave different performance benefits from each resource dur-ing each phase. In prior works fairness has been definedas a unique metric (eg. IPC, Power, Weighted Speed-up)for all applications. Therefore, in current approaches, theoptimization goal of algorithms is the same for all appli-cations. Consequently, we cannot sum up applications thatdesire different metrics in the same platform to decide on.However, if one application needs better IPC and anotherrequires lower energy, the previous algorithms are not ableto model it. The only way to address diversity of metrics (tobe optimized) is to have an appropriate translation betweendifferent metrics (eg. IPC to Power) that is not trivial, whilenot addressed in prior study.

In this paper, we present a game-theoretic resourceassignment method to address all the above shortcomingsincluding scalability, dynamism and fairness, while applica-tions can get their desired performance based on their utilityfunctions.

1. Semi-Decentralized: Dual of each centralized problemis decentralized, if the optimization goal is broken into asmaller meaningful sub-problems. In the context of hetero-geneous resource assignment this is straightforward. Theprofiling, analyzing and evaluating the demands are onapplication side, but the final decision on assigning theresources to applications based on the applications’ bids iseasily performed by the OS while they compete with eachother for the best assignment. Like a capitalist system, thecomplexity of the governing transfers to the independententities, and the government just make the policies and thefinal decisions. To achieve this, we introduce a novel market-based approach. Roughly speaking, the complexity of ourapproach in worst case scenario (for each application) isO(NK) where N is the number of the applications and K

App1 27.8% App1 35%

App2 36.84%

Static approaches: 27.8% Improvement CARMA: (36.84/2+35/2)%>27.8%

Static-based

%IP

C im

pro

vem

en

t

%IP

C im

pro

vem

en

t CARMA

Fig. 2: Performance comparison of static and dynamicscheduling of two applications (hmmer and mcf from Spec2006) under two different LLC sizes.

Page 3: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

3

is the number of available resources. However, on averagethe auction terminates in less than N/2 iterations.

2. Dynamic: In order to confront the scalability problemof previous approaches, we use a market-based approachto move the decision making to the individual applications.Iterative auctions have been designed to solve non-trivial re-source allocation problems with low complexity cost in gov-ernment sale of resources, eBay, real estate sales and stockmarket. Similarly, decentralized computation complexity islower than centralized (for each application) which providesthe opportunity to make the decision revisiting the allo-cation in small time quantum, or when a new applicationleaves or comes into the system.

3. Fair: The proposed method solves the heterogeneousresource assignment problem in the context of marketing.Applications’ demand regardless of the global optimizationobjective (IPC, Power, etc.) translates to the true valuation oftheir own performance. Resource assignment to the applica-tions with the highest bids is performed by the auctioneer(the OS); making it local optimization objectives. Hence,resource assignment can be performed for different appli-cations with different objectives known as utility functions.

Overall, the proposed approach for cache contentiongame on average brings in 33.6% improvement in systemperformance (when running 16 applications) compared toshared LLC; while reaching less than 11.1% of the maximumachievable performance in the best dynamic scheme. In thecase study of heterogeneous CPU assignment, it brings in106.6% improvement (when running 16 applications at thesame time). Also, the performance improvement increaseseven more as the number of co-running applications in-creases in the system.

Other potentials: We introduce an auction-based resourcemanagement approach for different applications in large-scale competition games. In short, we as a system ownerpay for a high-end CMP system for servers and guaranteethat each application/user takes its best from the system bypaying us back, or we as an application owner bid/pay thesystem to get the resources for my best performance. Theauctioneer is application-agnostic, and does not interferewith applications’ profile to globally optimize the system,but the applications compete for their own improvement.The two case studies of cache partitioning and CPU shar-ing are examples for resource sharing and the proposedapproach can be employed in other resource partitioningalgorithms.

The reminder of the paper is organized as follows. Sec-tion 2 discusses the background and motivation behindthis work. In section 3, we discuss our auction-based gamemodel. Section 4 discusses the case study of cache con-tention game and the case study of main processor andco-processor contention and simulation results. Section 5studies related works and Section 6 concludes the paperwith a summary.

2 MOTIVATION AND BACKGROUND

2.1 Motivation

Different applications have different resource constraintwith respect to CPU, memory, and bandwidth usage. Hav-

ing a single resource manager for all existing resources andusers in the system result in inefficiencies since it is notscalable and the operating system may not have enoughinformation about applications’ needs. For example, tradi-tional LRU-based cache strategy uses cache utilization as ametric to give larger cache size to the applications whichhave higher utilization and lower cache size to the appli-cations with lower cache utilization. However more cacheutilization does not always result in better performance.Streaming applications for example have very high cacheutilization, but very small cache reuse. In fact, the streamingapplications only need a small cache space to buffer thestreaming data. With rapid improvements in semiconductortechnology, more and more cores are being embedded intoa single core and managing large scale application using asingle resource manager becomes more challenging.

In addition, defining a single fairness parameter for multi-ple applications is non-trivial since applications have differ-ent bottlenecks and may get different performance benefitsfrom each resources during each phases of their executiontime. Defining a single reasonable parameter for fairnessis somewhat problematic. For instance, simple assignmentalgorithms which try to equally distribute the resources be-tween all applications ignores the fact that different applica-tions have different resource constraints. As a consequence,this makes the centralized resource management systemsvery inefficient in terms of fairness as well as performanceneeds of applications. We need a decentralized framework,where all applications’ performance benefit could be trans-lated into a unique notion of fairness and performanceobjective (known as utility function in economics) and thealgorithm tries to allocate resources based on this translatednotion of fairness. This translation has been well definedin economics and marketing, where the diversity of cus-tomer needs, makes more economically efficient market [20].Economists have shown that in an economically efficientmarket, having diverse resource constraints and letting thecustomers compete for the resources can make a Nash equi-librium where both the applications and the resource man-agers can be enriched. Furthermore, applications’ demandchanges over time. Most resource allocation schemes pre-allocate the resources without considering the dynamism inapplications’ need and number of users sharing the sameresource over time. Therefore, applications’ performancecan degrade drastically over time. Figure 3 shows phasetransitions for instruction per cycle (IPC) of mcf applicationfrom spec 2006 over 50 billion instructions.

We try to find a game-theoretic distributed resource man-agement approach where the shared hardware resources areexposed to the applications and we show that by runninga repeated auction game between different applicationswhich are assumed to be rational, the output of the gameconverges to a balanced Nash equilibrium allocation. Inaddition, we compare the convergence time of the proposedalgorithm in terms of dynamism in the system. We evaluateour model with two case studies: 1) Private and sharedlast level cache problem, where the applications have todecide if they would benefit from a larger cache spacewhich can potentially get more congested or a smaller cachespace which is potentially less congested. 2) Heterogeneousprocessors (Intel Xeon and Xeon Phi) problem, where we

Page 4: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6.5

8E+

08

1.1

9E+

09

1.6

9E+

09

2.1

6E+

09

2.6

1E+

09

3.0

6E+

09

3.4

9E+

09

3.9

4E+

09

4.3

9E+

09

4.8

1E+

09

5.1

7E+

09

6.5

2E+

09

7.3

6E+

09

9.1

3E+

09

1.2

5E+

10

1.4

9E+

10

1.7

1E+

10

1.9

2E+

10

2.1

3E+

10

2.3

2E+

10

2.5

1E+

10

2.6

7E+

10

2.8

1E+

10

2.9

2E+

10

3.0

2E+

10

3.1

1E+

10

3.1

9E+

10

3.2

7E+

10

3.3

6E+

10

3.4

5E+

10

3.5

5E+

10

3.6

4E+

10

3.7

3E+

10

3.8

8E+

10

4.0

7E+

10

4.1

6E+

10

4.2

2E+

10

4.2

7E+

10

4.3

2E+

10

4.3

6E+

10

4.4

1E+

10

4.4

6E+

10

4.5

0E+

10

4.5

4E+

10

4.5

9E+

10

4.6

3E+

10

4.6

8E+

10

4.7

2E+

10

4.7

7E+

10

4.8

1E+

10

4.8

6E+

10

4.9

0E+

10

4.9

5E+

10

4.9

9E+

10

IPC

Instructions

IPC 2MB Curve fitting (IPC 2MB) Curve fitting (IPC 128kB)

Fig. 3: Phase transition in mcf with different L2 cache sizes.

perform experiments to show how congestion affects theperformance of different applications running on an IntelXeon or Xeon Phi co-processors. Depending on the amountof congestion in the system, the application can offload themost time consuming part of its code on the Xeon Phi co-processors or not.

2.2 Background

Game theory has been used extensively in economics,political and social decision making situations [21–28]. Agame is a situation, where the output of each player notonly depends on her own action in the game, but alsoon the action of other players [29]. Auction games are aclass of games which has been used to formulate real-worldproblems of assigning different resources between n users.Auction game framework can model resource competition,where the payoff (cost) of each application in the system isa function of the contention level (number of applications)in the game.

Inspired by market-based interactions in real life games,there exists a repeated interaction between competitors in aresource sharing game. Assuming large number of applica-tions, we show that the system gets to a Nash equilibriumwhere all applications are happy with their resource assign-ment and don’t want to change their state. Furthermore,we show that the auction model is strategy-proof, such thatno application can get more utilization by bidding more orless than the true value of the resource. In this paper wepropose a distributed market based approach to enforce coston each resource in the system and remove the complexityof resource assignment from the central decision maker.

The traditional resource assignment is performed bythe operating system or a central hardware to assign fairamount of resources to different applications. However, fairscheduling is not always optimal and solving the optimiza-tion problem of assigning m resources between n users inthe system is an integer programming which is an NP-hardproblem and finding the best assignment problem becomescomputationally infeasible. Prior works focus on designinga fair scheduling function that maximizes all application’sbenefit [30–34], while applications might have completelydifferent demands and it is not possible to use the samefairness function for all. By shifting decision making to theindividual applications, the system becomes scalable andthe burden of establishing fairness is removed from thecentralized decision maker, since individual applicationshave to compete for the resources they need. Applications

start by profiling the utility function for each resource andbid for the most profitable resource. During the course ofexecution time they can update their belief based on theobserved performance metrics at each round of the auction.Updating the utility functions at each round of the auctionis based on the history of the observed performance metricswhich shows the state of the game. This state indicates thecontention on the current acquired resources. The payofffunction in each round depends on the state of the systemand on the action of other applications in the system.

2.2.1 Sequential Auction

Auction-based algorithms are used for maximumweighted perfect matching in a bipartite graph G =(U, V,E) [35–37]. A vertex Ui ∈ U is the application in theauction and a vertex Vj ∈ V is interpreted as a resource.The weight of each edge from Ui to Vj shows the utility ofgetting that particular resource by Ui. The prices are initiallyset to zero and will be updated during each iteration ofthe auction. In sequential auctions, each resource is takenout by the auctioneer and is sequentially auctioned to theapplications, until all the resources are sold out.

2.2.2 Parallel Auction

In a parallel auction, the applications submit their bidsfor the first most profitable item. The value of the bid ateach iteration is computed based on the difference of thehighest profitable object and the second highest profitableobject. The auctioneer would assign the resources based onthe current bids. At each iteration, the valuation of eachresource is updated based on the observed information dur-ing run-time which shows the contention on that particularresource.

3 THE METHOD

Consider n applications and i instances of m different re-sources. Applications arrive in the system one at a time. Theapplications have to choose amongm resources. There existsa bipartite graph between the matching of the applicationsand the resources.

In general, there can be more than one application to geta shared resource. However, each application cannot getmore than one of the available heterogeneous resources. Forexample, if we have two cache spaces of size 128kB (oneway) and 256kB (two ways), each application can either getthe 128kB, or the 256kB cache space and can’t get both ofthem at the same time. Furthermore, each resource Rk hasa cost pk which is defined by the applications’ bid in theauction.

Figure 4 shows our auction-based framework to sup-port CARMA between N applications that execute togethercompeting for M different resources. Each application hasa utility table that shows how much performance it getsfrom each M resources at each time slot. Based on theutility tables, applications submit bids for the most prof-itable resource. Based on the submitted bids, the auctioneerdecides about the resource assignment for each resource,

Page 5: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

5

TABLE 2: Notation used in our formulations.

N Number of players or applications (from App1 toAppN ).

M Number of resources (from R1 to RM ).~m A positive M × 1 vector in the resource space that

shows how much each application gets from eachresource.

T Time intervals where the bidding is heldti,j j-th phase time for i-th application during its course

of execution time.Ti Last phase time for application i.vi(t, ~m) The valuation function of application i for the re-

source assignment ~m at time t.vi,j(t, ~m, r) The valuation function of (application i,resource j), if

we replace the j-th resource in the resource vector mby r

δ dynamic factor that shows how much we can rely onthe past iterations.

G = (U, V,E) A bipartite graph showing the resource allocationbetween the applications and the set of resources.

U The set of applications which shows the left set ofnodes in the bipartite graph G = (U, V,E).

V The set of resources which shows the right set ofnodes in the bipartite graph G = (U, V,E).

E The edges in the bipartite graph.bi,k User i’s bid for k-th resource.Fi The total budget (summation of bids) a user have.Ck The total capacity of each resource.pk The price of resource k ∈ V in the auction.K Number of cache levels

and updates the prices. Next, the applications who did notget any assignment compete for the next most profitableresource based on the updated prices repeatedly until allapplications are assigned. Figure 4 shows an example of aresource assignment and the corresponding bipartite graph.

3.1 Problem Defenition

We formulate our problem as an auction to enforcecost/value updates for each resource as follows:

• Valuation vi(t, m̃): Application i has a valuationfunction which shows how much it benefits from theresource vector ~m at time t. The valuation functionat time t = 0 for cache contention case study isderived from the IPC (instruction per cycle) curvesusing profiling, and for processor and co-processorcontention case study is derived from the profilingof separate cache performance of the application.However, in general, each application can choose itsown utility function.

• Observed information: The observed informationat each time step is the performance value of theselected action in the game. Therefore, the applica-tions repeatedly update the history of their valuationfunction over time.

• Belief updating: Let T be the time intervals wherethe bidding is held. At each iteration step of theauction, the applications update their valuation ofeach resource based on the observed performance onthe resource vector. The update at time W is derivedusing the following formula:

vi(W, ~m) =

∑0≤n≤W/T

δW/T−n · vi(nT, ~m)∑0≤n≤W/T

δW/T−n, (1)

where vi(W, ~m) shows the observed valuation ofresource vector ~m at time W by application i inthe system; δ shows the discount factor between0 and 1 which shows how much a user relies onits past observations in the system. The discountfactor is chosen to show the dynamics of the system.If the observed information in the system changesfast, the discount factor is close to zero, i.e. theapplication cannot rely on the past observationsvery much. However, if the system is more stableand the observed information does not change fast,the discount factor is closer to 1. If a user fails inan auction, its payoff and corresponding observedvaluation at the current time is equal to zero. So,it won’t probably bid for the same resource vectoragain, since its valuation decreases for next round.We choose the discount factor to be the absolutevalue of the correlation coefficient of the observedvalues of the valuations at each iteration step whichis calculated as follows:

δ =E(vi(W, ~m))2

σvi(W,~m)2

(2)

• Action: At each time step, the applications decidewhich resource to bid and how much to bid for eachresource.

Table 2 shows important notation used throughout thepaper. In the following sections, we describe our distributedoptimization scheme to solve the problem.

3.2 Distributed Optimization Scheme

The goal is to design a repeated auction mechanism whichruns by the operating system to guide the applications tochoose their best resource allocation strategy. The applica-tions’ goal is to maximize their own performance and theoperating system wants to maximize the total utility gainfrom the applications. Each application can use its ownutility function and evaluates the resources based on thedesired value of the resources.

Applications’ approach: The application i wants to max-imize the expected utility (pay-off) with respect to a limitedbudget (Fi) during all phases of its execution time. We have:

∀i ∈ U maximize∑

0<t<Ti

vi(t, ~m)− bi(t, ~m),

subject to∑

0<t<Ti

bi(t, ~m) ≤ Fi, ∀i ∈ U. (3)

OS’s approach: The operating system wants to maxi-mize the social welfare function which is translated intosubmitted bids from the applications in a limited resourceconstraints.

maximize

N∑i=1

∑0<t<Ti

bi(t, ~m) ·Ai(t, ~m),

subject to

N∑i=1

Ai(t, ~m) ≤ Amax, ∀t : 0 ≤ t ≤ T,

Ai(t, ~m) ∈ {0, 1}, ∀i ∈ U, ∀~m ⊆ V, ∀t : 0 ≤ t ≤ T, (4)

Page 6: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

6

Auctioneer

bid1,1App1

.

.

.

.

.

.

v1(t, m)

vN(t,m)

App

1

App

2

App

N

.

.

.

v2(t, m)App2

AppN

.

.

.

R1

R2

RMbidN,M

Allocation1

Allocation2

AllocationN

update

update

update_p2

App

1

App

2

App

N

R1

R2

RM

update

Fig. 4: Framework for auction-based resource assignment (CARMA).

where the binary variable Ai(t, ~m) represents the decisionto assign resource vector ~m to application i at time t (whenAi(t, ~m) = 1) or not (when Ai(t, ~m) = 0); and V is thevector space of the all resource vectors (∀~m); and Amaxshows the maximum number of the applications which canshare the resource vector ~m.

Illustrative example: As an illustrative example shown inFigure 2, let us consider a case where we have two differ-ent resources, a large cache of 1MB which can be sharedbetween applications, and two private caches of 512KBwhich are not shared. The first application participates inthe auctions with 35 bid at the first phase and 21 bid atthe next phase. The second application participates in theauctions with 15 bid at the first phase and 37 bid at the nextphase. The auctioneer (OS) decides to allocate larger cachein the auctions at first phase to the first application and atnext phase to the second application.

3.3 Analysis

The distributed optimization problem is hard to solve.However, in reality, the problem can split into simplersubproblems since each application knows its bottleneckresource and would first bid for the first bottleneck resourceto maximize its utility.

We suppose all applications in the system are risk-neutralwhich means they have a linear valuation of the utilityfunction. Each risk-neutral agent wants to maximize itsexpected revenue. Risk attitude behaviors are defined in[38] where the agents can broadly be divided into risk-averse, risk-seeking and risk neutral. Risk-averse agentsprefer deterministic values rather than risky value profitsand risk-seeking applications have a super-linear utilityfunction and prefer risky utilities than sure utilities. Next,we derive the Bayes Nash equilibrium strategy profile forall agents in the system assuming risk neutrality.

Definition 1. A strategy profile a is a pure Nash equilibriumif for every application i and every strategy a′i 6= ai ∈ A

we have ui(ai, a−i) ≥ ui(a′i, a−i)

Theorem 1. Suppose n risk-neutral applications whosevaluations are derived uniformly and independently fromthe interval [0, 1] compete for one resource which can beassigned to m applications who have the highest bid inthe auction. We will show that Bayes Nash equilibriumbidding strategy for each application in the system is tobid n−m

n−m+1vi where vi is the profit of application i forgetting the specified resource.

Theorem 1, states that whenever there is a single resourcethat users compete to get it with different valuation func-tions, the Nash equilibrium strategy profile for risk-neutralusers is to bid n−m

n−m+1vi. This term tends to the true value ofthe object when n is a large number.

In case of more than one resource competition we deriveAlgorithm 1 for heterogeneous resource assignment andwill prove that it has a Nash equilibrium in the game.Algorithm 1 uses Algorithm 2 to do budget planning for ourpurpose. In the first step, all valuations are set to the solo-run of application’s performance. Next, each applicationsubmits a partial bid for its first bottleneck resource. Thepartial bid should be larger than the price of the object whichis initialized to zero at the beginning of the program. Theapplications only have the incentive to bid a value that isno more than the difference between the first and secondbottleneck resource. Otherwise, it submits a smaller bid tothe second bottleneck and gets the same revenue as payingmore for the first bottleneck resource. In order to break theequal valuation function between two different applications,we use ε scaling such that at each iteration of the auctionthe prices should increase by a small number. The OS willset the resources’ price with these partial bids, and find theminimum of the highest partial bids for each resource. Theapplications recurse for all the resources, and the total bidis the summation of the partial bids for each application.Then, the applications execute Algorithm 2 to participate inthe auction or not. Finally the participated applications with

Page 7: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

7

Algorithm 1: Parallel Auction for Heterogeneous Re-source Assignment

Input: A bipartite Graph (U, V, E).Output: The allocation of resources to applications.

1 The initial resource vector for each application is theaverage amount across various resources. At timet = nT , the valuation of each application for eachresource vector is updated using Eq. 1.

2 For application Ui ∈ U , the first bottleneck resource is

Vi,jmax1

= max1≤j1≤M

∆vi,j1(t, ~m, rj1)− pj1

where the differential valuation function is∆vi,j(t, ~m, rj) = vi,j(t, ~m, rj)− vi(t, ~m).

3 Find the second bottleneck resource for applicationUi ∈ U in the system:

Vi,jmax2

= max1≤j2≤M ;j2 6=j1

∆vi,j2(t, ~m, rj2)− pj2

4 Each application calculates the partial bid for its firstbottleneck resource using the following formula:

bi,jmax1

(t) = Vi,jmax1− Vi,jmax

2+ pjmax

1+ ε

5 Each resource rj ∈ V which can be shared between Lapplications, is assigned to the L highest biddingapplications Winneri,j = {i1, i2, ..., iL} and the pricefor that resource is updated as follows:

pj = maxl∈{1,...,L}

bil,j

6 The minBid for each resource is updated as theminimum bid of l applications who acquired j-thresource. That is:

Bminj = minil∈Winneri,j

bil,j

7 Goto step 2 until all partial bids for all resources aredetermined.

8 The total bid of the application i is as follows:

bi(t, ~m) = bi(t, [rj1 ; rj2 ; ...; rjM ]) =

M∑j=1

bi,j(t)

where iteratively ~m = [rj1 ; rj2 ; ...; rjM ].9 Find the estimated investment Ii(t) using Algorithm 2

to plan the upper bound of the investment withrespect to the budget Fi. If Ii(t) ≥ bi(t, ~m), applicationi will participate in this auction at time t, otherwise itquits and other applications do the steps 2 and 3.

the bids higher than Bminj will get j-th resource.The overhead of the auction for the auctioneer (the OS)

is very negligible. The OS during the auction only sets theprices of the resources based on the received bids from theapplications and gives the resources to the highest bids. Soevery T seconds, the OS runs these two jobs, which addsa negligible overhead with respect to other tasks of the OS.Our approach also satisfies the following properties: 1) Indi-vidual rationality (IR): Applications’ expected utility is non-negative because the amount of the bid cannot be beyond

Algorithm 2: Budget PlanningInput: A bipartite Graph (U, V, E).Output: Participation (YES) in an auction or Quit (NO).

1 At time t = nT , assume that we have the same state interms of resources.

2 For application Ui ∈ U , we similarly run the steps 2, 3,4, 5 and 6 of Algorithm 1 to find all estimated bids innext rounds based on its various phases. We have:

bi(ti,j , ~m) =

M∑j=1

bi,j(t);∀ti,j > t

Also, we have the previous bids of the application i:

bi(ti,j , ~m);∀ti,j < t

3 If Fi ≥∑∀ti,j 6=t

bi(ti,j , ~m), then YES and the application

will participate in the auction. Otherwise NO, and theapplication will update the zero valuation for currentround using Eq. 1.

the sum of the difference of the valuations which is at mostthe highest valuation of the application. 2) Truthfulness:Applications cannot benefit from bidding other than theirtrue valuation. By contradiction, if an application bids lowerthan the true value, there may be another application with ahigher bid to take the resource. But we cannot guarantee thetruthfulness in the case of collusion among applications. 3)Budget-balance: The whole payments from the applicationsare less than the OS revenue, which is trivial as we haveonly one seller which is the OS. 4) Economic efficiency: Ithas been shown in [35] that this assignment is optimal, butit doesn’t mean it is economically efficient since we knowthat it depends on the applications’ valuation which is sub-optimal.

4 CASE STUDIES

4.1 CPU Scale-up Scale-out Game

The emerging high-performance computing applicationslead to the advent of Intel Xeon Phi co-processor, that whentheir highly parallel architecture is fully utilized, can runin order of magnitude more performance than the existingprocessor architectures. The Xeon Phi co-processors are thefirst commercial product of Intel MIC processors wherethe hardware architecture is exposed to the programmerto choose running the code on either Xeon processor orXeon Phi co-processors. It is possible that, during the courseof execution, either the processor or the co-processor getcongested and the performance of the application degradesa lot. Therefore, making a decision to offload the mosttime-consuming part of the program on Xeon or Xeon Phishould be made online, based on the contention level. Inthis section, we look at the case study of our auction-basedmodel on decision making of running the application on themain or co-processor in a highly congested environment.

The experimental results of this section are run on Stam-pede cluster of Texas Advanced Computing Center. Table 3

Page 8: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

8

TABLE 3: The comparison of Intel Xeon processor and IntelXeon Phi processor.

Processors Xeon E5-2680 Xeon Phi SE10PCores/Sockets 8/2 61/1

Clock Frequency 2.7 GHz 1.1 GHzMemory 32GB 8x4G 4-

channels DDR3-1600MHz

8GB GDDR5

L1 cache 32 KB 32 KBL2 cache 256 KB 512 KBL3 cache 20 MB -

shows the comparison of Intel Xeon and Xeon Phi architec-tures which is used in this section. It is observed that conges-tion has a significant impact on the performance of runningthe application on Xeon and Xeon Phi machines. Since mostcloud computing machines are shared between thousandsof users, the programmer not only should get the benefitof parallelism by offloading the most time-consuming partof the code to the larger number of low-performance cores(Xeon Phi) but also should consider the congestion level(number of co-runners) in the system. To this end, weperformed experiments on Stampede clusters. We executedMiniGhost application which is a part of Mantevo project [39]which uses difference stencils to solve partial differentialequations using numerical methods. The applications usethe profiling utility functions at t = 0 and during thecourse of execution update the utility function based on theobserved performance on each core using Equation 1. Then,they can revisit their previous action on running the codeon either the processor or co-processor during run-time.

Figure 5 shows the total execution time with respect tocongestion we made in Xeon and Xeon Phi. In this exper-iment we ran the same problem size on a Xeon and XeonPhi machine multiple times so that we could see the effectof load on the total execution time of our application. Itwas observed that with the same number of threads Xeon’sperformance degrades more than Xeon phi. Next, we triedto change the application behavior using a congestion-aware game theoretic algorithm to offload the most time-consuming part of the application based on the performancebehavior of applications. Figure 6 shows the result of ourgame-theoretic model during the execution time. It is ob-served that during the course of execution, the applicationschange their strategy on either choosing the main proces-sor or the co-processor and all applications’ performanceconverge to an equilibrium point where applications don’twant to change their strategy.

Furthermore, it is shown that CARMA can bring in up to106.6% improvement in total execution time of applicationscompared to static approach when the number of co-runnersis six. The performance improvement would be significantwhen the number of co-runners increase. Figure 7 shows theperformance comparison of CARMA and static approachwhich does not consider the congestion dynamism in thesystem and the decision is only made based on the paral-lelism level in the code.

0 2 4 6 8 10 12 14 16

1 2 3 4 5 6

Execution time (s)

Load (#of tasks)

Xeon Phi Xeon

Fig. 5: Congestion effect on Xeon and Xeon Phi machines.

0 1 2 3 4 5 6 7 8

1 2 3 4 5 6 7

Execution time (s)

Time Steps

App1 App2 App3 App4 App5 App6

Fig. 6: Performance of 6 instances of applications during theexecution time for our proposed game model.

0 1 2 3 4 5 6 7 8

1 2 3 4 5 6

Execution time (s)

Number of applications

Static CARMA

Fig. 7: Performance comparison of congestion-aware sched-ule versus static schedule.

4.2 A Case Study of Private and shared cache game

One of the challenging problems in CMP resourcemanagement systems is whether applications benefit froma shared large last level cache or an isolated private cache.We evaluated CARMA’s performance, on a 10MB LLCshown in Figure 8, where 2MB, 1MB, 512kB, 256kB and128kB levels of LLC can potentially be shared between 16,8, 4, 2 and 1 applications respectively, the cache levels have16, 8, 4, 2, and 1 ways. Table 5 summarizes the studiesworkloads and their characteristics, including miss perkilo instructions (MPKI), memory bandwidth usage, andIPC. We use applications from Spec 2006 benchmark suite[40]. We use Gem5 full system simulator in our experiment[41, 42]. Table 4 shows the experimental setup in ourexperiments.

To evaluate the performance of our proposed approachwe use utility functions for different number of cache waysshown in Figure 9. These utility functions at the start of theexecution can be found using either profiling techniques orstack distance profile [5, 43, 44] of applications assumingthere are no co-runners in the system. Next, during run-time, the applications can update their utility functions

Page 9: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

9

128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K 128 K

256 k256 k256 k256 k256 k256 k256 k256 k

512 k512 k512 k512 k

1 M1 M

2 M

Fig. 8: The proposed last level cache hierarchy model.

TABLE 4: Experimental Setup.

Processors Single threaded with privateL1 instruction and data caches

Frequency 1GHzL1 Private ICache 32 kB, 64-byte lines, 4-way as-

sociativeL1 Private DCache 32 kB, 64-byte lines, 4-way as-

sociativeL2 Shared Cache 128 kb-2 MB, 64-byte lines, 16-

way associativeRAM 12 GB

TABLE 5: Evaluated workloads.

# Benchmark MPKI Memory BW IPC1 astar 1.319 373 MB/s 2.0572 bwaves 10.47 1715 MB/s 0.6613 bzip2 3.557 1194 MB/s 1.3674 dealII 0.935 307 MB/s 2.1075 GemsFDTD 0.004 2.19 MB/s 2.0236 hmmer 2.113 1547 MB/s 2.8617 lbm 19.287 3954 MB/s 0.5338 leslie3d 8.469 1942 MB/s 1.2979 libquantum 10.388 1589 MB/s 0.53110 mcf 16.93 820 MB/s 0.07311 namd 0.051 20.32 MB/s 2.36212 omnetpp 10.34 1147 MB/s 0.50413 sjeng 0.375 139.2 MB/s 1.40314 soplex 4.672 390.8 MB/s 0.51315 sphinx3 0.349 202.8 MB/s 2.22316 streamL 31.682 3619 MB/s 0.58117 tonto 0.260 107 MB/s 2.03618 xalancbmk 12.703 1200 MB/s 0.558

based on Equation 1. Therefore, there is a learning phasewhere applications learn about the state of the system andupdate the utilities accordingly. The stack distance profileindicates how many more cache misses will be added if theapplication has less number of ways in the cache. Basedon the stack distance profile, the applications can updatetheir utility function and bid for the next iteration of theauction if they like to change their allocation. Next, webring an example of the auction for one time step of thegame. This time step can be repeated once an applicationarrives or leaves the system or when an application’sphase changes during run-time. However, in case of oneapplication’s phase change or arriving or leaving thesystem, the algorithm reaches the optimal assignment inmuch fewer iterations since all other assignments are fixedand a few applications would be affected.

Example: As an example, suppose we have 5 differentapplications and 5 different cache levels with differentcapacities of 128KB, 256KB, 512KB, 1MB and 2MB. In

addition, suppose the 128kB cache level can not accomodatemore than one application and 256kB cache can accomodate2 applications, 512kB level can have 4 applications, 1MBcache can have 8 applications and 2MB cache can have atmost 16 applications. Let’s assume the following matrix bethe utility function of each application on each cache level.

Some applications may get better utility from smallercache space since they are less congested and since theseapplications have low data locality, moving to larger cachespaces not only does not increase their performance but alsodegrades the performance by evicting other applicationsfrom the cache and making contention on the memorybandwidth which is a more vital resource for them 1.

M =

1way 2way 4way 8way 16way

App1 1.9 1.7 1.5 1 0.9App2 1.6 1.3 1.1 0.8 0.7App3 1.4 1.0 0.6 0.5 0.4App4 0.3 0.6 0.9 1.2 1.4App5 0.7 0.8 1.1 1.4 1.7

(5)

In the first iteration of the bidding, the first 3 applicationsbid for the most profitable resource which is 128kB cacheand they submit a bid equal to the difference of profitbetween the first and the second most profitable resource.Therefore, the first application, submits 0.2 bid to 128kband the second application submits 0.3 and the third ap-plication submits 0.4. Since only one of the players canacquire the 128kB cache space, the first application willget it. The 4th and 5th application compete for 2MB cachespace and they both get it with the sum bid of both whichis 0.5. In the next round, the prices will be updated andsince applications 2 and 3 don’t have any cache assignmentcompete for the 256kB cache space and each bid 0.2 whichis the difference between 1.7 and 1.5 and 1.3 and 1.1 inthe performance matrix accordingly. Since the second levelcache can accommodate both applications the price will beupdated and the minimum bidding price for someone toget this cache level is updated to the minimum bid of bothwhich is 0.2. Therefore, if some application bid more than0.2 it can acquire the resource and the application with thesmallest bid has to resubmit the bid to acquire the resource.Figure 10a, 10b, and 10c show the bidding steps and theprices and minimum price of bidding accordingly. As seenfrom the figures, the auction terminates in three iterationswhen there exist five applications.

4.3 A Case Study for Hybrid Cache Game

In hybrid cache game, each cache partition can have acluster of applications. We use different mixes of 4 to 16applications from Spec 2006 to evaluate the performance ofour proposed approach compared to others. To evaluateour approach, we selected the state-of-the-art centralizedcache partitioner [? ] (KPart) as a competitor which aimsat maximizing the global IPC speedup. CARMA uses multi-resource valuations, so each application can have any cri-teria to maximize its payoff. In order to provide a faircomparison with our approach, we use IPC speedup as the

1. libquantum, streamL, sphinx3, lbm and mcf are examples of suchapplications.

Page 10: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

10

0

0.5

1

1.5

2

2.5

3

128

256

512

1024

2048

4096

8192

0.77

0.775

0.78

0.785

0.79

0.795

0.8

0.805

128

256

512

1024

2048

4096

8192

0

0.5

1

1.5

2

2.5

128

256

512

1024

2048

4096

8192

1.8

1.85

1.9

1.95

2

2.05

2.1

2.15

128

256

512

1024

2048

4096

8192

2.021

2.022

2.023

2.024

2.025

2.026

128

256

512

1024

2048

4096

8192

0

1

2

3

4

0

0.2

0.4

0.6

0.8

128

256

512

1024

2048

4096

8192

1.25

1.3

1.35

1.4

1.45

1.5

1.55

128

256

512

1024

2048

4096

8192

0.615

0.62

0.625

0.63

0.635

0.64

0.645

128

256

512

1024

2048

4096

8192

0

0.05

0.1

0.15

128

256

512

1024

2048

4096

8192

2.02

2.022

2.024

2.026

2.028

2.03

2.032

128

256

512

1024

2048

4096

8192

0

0.2

0.4

0.6

0.8

1

1.2

128

256

512

1024

2048

4096

8192

1.365

1.37

1.375

1.38

1.385

1.39

1.395

1.4

128

256

512

1024

2048

4096

8192

0

0.2

0.4

0.6

0.8

1

128

256

512

1024

2048

4096

8192

2.06

2.08

2.1

2.12

2.14

2.16

128

256

512

1024

2048

4096

8192

0.5811

0.5812

0.5813

0.5814

0.5815

0.5816

128

256

512

1024

2048

4096

8192

2

2.02

2.04

2.06

2.08

2.1

128

256

512

1024

2048

4096

8192

0

0.2

0.4

0.6

0.8

1

1.2

128

256

512

1024

2048

4096

8192

astar bwaves bzip2 dealII GemsFDTD hmmer

lbmleslie3d

libquantum

mcf namd omnetpp

sjeng soplex

sphinx3

streamL

tontoxalancbmk

Fig. 9: IPC for different size of LLC.

App(1)

App(2)

App(3)

App(4)

128kB

cache

256kB

cache

2MB

512kB

cache

App(5)

1MB

Bid

0.2

Bid

0.3

Bid

0.4

Bid

0.2

Bid

0.3

P=0

P=0

P=0

P=0

P=0

(a) first round.

App(1)

App(2)

App(3)

App(4)

128kB

cache

256kB

cache

2MB

512kB

cache

App(5)

1MB

Bid

0.2

Bid

0.2

P=0.4

Minbid=0.4

P=0

P=0

P=0

P=0.5

Minbid=0

(b) second round.

App(1)

App(2)

App(3)

App(4)

128kB

cache

256kB

cache

2MB

512kB

cache

App(5)

1MB

P=0.4

Minbid=0.4

P=0.4

Minbid=0.2

P=0

P=0

P=0.5

Minbid=0

(c) third round.

Fig. 10: Cache allocation, a) first round, b) second round and c) third round of bidding.

optimization goal for all applications.Figure 11 shows the normalized throughput of 10 differ-

ent mix of applications [? ], using CARMA, KPart [? ], equalseparate cache partitioning and completely shared cachespace after convergence. Furthermore, Figure 12 shows thescalability of our proposed algorithm. When the number ofco-runners increases from 2 to 16, the performance improveswithout any need to track each applications’ performancein a central module. Having full information about appli-cations’ profiles, CARMA outperforms the other centralizedcompetitors, when the number of the applications increases.

Since KPart is a centralized (not an auction-based) ap-proach, we assume that it has an unlimited budget. Thebudget matters in CARMA. We setup another experiment to

track the variations of the normalized throughput versus thenormalized budget for a mix of 16 applications. Figure 13shows that the throughput of CARMA, is very sensitive tothe budget. The throughput changes dramatically at someinflection point, and at the end it is saturated but higherthan KPart.

5 RELATED WORK

With rapid improvement in computer technology, moreand more cores are embedded in a single chip and ap-plications competing for a shared resource is becomingcommon. On the one hand, managing scheduling of sharedresources for a large number of applications is challenging

Page 11: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

11

0.9 0.95

1 1.05 1.1

1.15 1.2

1.25 1.3

Mix-01

Mix-02

Mix-03

Mix-04

Mix-05

Mix-06

Mix-07

Mix-08

Mix-09

Mix-10

Th

rou

gh

pu

t n

orm

ali

zed

to

sh

are

d L

LC

Shared Solo KPart CARMA

Fig. 11: Throughput of a shared, solo, CARMA and KPartcache allocation schemes.

0

10

20

30

40

50

2-apps 4-apps 6-apps 8-apps 10-apps 12-apps 14-apps 16-apps

% P

erfo

rm

an

ce i

mp

ro

vem

en

t

CARMA Confidence Interval CARMA

KPart Confidence Interval KPart

Fig. 12: Performance improvement of CARMA and KPartfor different number of applications with respect to sharedLLC for the case study of cache congestion game.

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2

Th(

CA

RM

A)/

Th(

KP

art)

Application budget per IPC

Relative Throughput

Fig. 13: Relative throughput w.r.t applications’ normalizedbudget.

in a sense that the operating system doesn’t know whatis the performance metric for each application. But on theother hand, the operating system has a global view of thewhole state of the system and can guide applications onchoosing the shared resources.

There have been several works, for managing the sharedcache in multi-core systems. Qureshi et al. [13] showed thatassigning more cache space to applications with more cacheutility does not always lead to better performance sincethere exist applications with very low cache reuse whichmay have very high cache utilization.

Several software and hardware approaches have beenproposed to find the optimal partitioning of cache spacefor different applications [3]. However, most of these ap-proaches use brute force search of all possible combinationsto find the best cache partitioning in runtime or introducea lot of overhead. There have been some approaches whichuse binary search to reduce searching all possible combina-tions [5, 14, 45]. But none of these methods are scalable forthe future many-core processor designs.

There exists prior game-theoretic approaches designinga centralized scheduling framework that aims at a fairoptimization of applications’ utility [30–34]. Zahedi et al. inREF [30, 33] use the Cobb-Douglas production function as afair allocator for cache and memory bandwidth. They showthat the Cobb-Douglas function provides game-theoreticproperties such as sharing incentives, envy-freedom, andPareto efficiency. But their approach is still centralized andspatially divides the shared resources to enforce a fairnear-optimal policy sacrificing the performance. In theirapproach, the centralized scheduler assumes all applicationshave the same priority for cache and memory bandwidth,while we do not have any assumption on this. Further,our auction-based resource allocation can be used for anynumber of resources and any priority for each applicationand the centralized scheduler does not need to have a globalknowledge of these priorities.

Ghodsi et al. in DRF [32] use another centralized fairpolicy to maximize the dominant resource utilization. Butin practice, it is not possible to clone any number ofinstances of each resource. Cooper [31] enhances REF tocapture colocated applications fairly, but it only addressesthe special case of having two sets of applications withmatched resources. Fan et al. [34] exploits computationalsprinting architecture to improve task throughput assuminga class of applications where boosting their performance byincreasing the power.

While all prior works use a centralized scheduling thatprovides fairness and assumes the same utility function forall, co-runners might have completely diverse needs and itis not efficient to use the same fairness/performance policyacross them. Our auction-based resource scheduling pro-vides scalability since individual applications compete forthe shared resources based on their utility and the burdenof decision making is removed from the central scheduler.We believe that future CMPs should move toward a moredecentralized approach which is more scalable and providesa fair allocation of resources based on the applications’needs.

Auction theory which is a subfield of economics hasrecently been used as a tool to solve large-scale resourceassignment in cloud computing [46, 47]. In an auctionprocess, the buyers submit bids to get the commodities andsellers want to sell their commodities with the maximumprice as possible. Also auction-based allocators [? ? ] aremulti-buyers with multi-seller but there is only one resourceto bid. So, they cannot be used for our purpose, sincewe have only one seller with multiple bundled resources.That is why we choose a simpler related scheme for acomputer architecture to get higher performance with lowertransactions and auctions.

Our auction-based algorithm is inspired by work of Bert-sekas [35] that uses an auction-based approach for networkflow problems. Our algorithm is an extension of local as-signment problem proposed by Bertsekas et al. that has beenshown to converge to the global assignment within a linearapproximation.

Page 12: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

12

6 CONCLUSION

This paper proposes a distributed resource allocationapproach for large-scale servers. The traditional resourcemanagement system is not scalable, especially when track-ing the application’s dynamic behavior. The main cause ofthis complexity is the centralized decision making whichleads to higher time and space complexity. With increas-ing number of cores per chip, the scalability of assigningdifferent resources to different applications becomes morechallenging in future generation CMP systems. In addition,diversity in application’s need makes a single objectivefunction inefficient to get an optimal and fair performancemetric. We introduce a framework to map the allocationproblem to the known auction economy model where theapplications compete for the shared resources based on theirutility metrics of interest.

REFERENCES

[1] D. Z. Tootaghaj and F. Farhat. Cage: A contention-aware game-theoretic model for heterogenous resourceassignment. In The 35th IEEE International Conference onComputer Design (ICCD), 2017.

[2] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L.Soffa. The impact of memory subsystem resource shar-ing on datacenter applications. In Computer Architecture(ISCA), 2011 38th Annual International Symposium on,2011.

[3] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Ad-dressing shared resource contention in multicore pro-cessors via scheduling. In ACM SIGARCH ComputerArchitecture News. ACM, 2010.

[4] L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni.Communist, utilitarian, and capitalist cache policies oncmps: caches as a shared resource. In PACT. ACM,2006.

[5] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharingand partitioning in a chip multiprocessor architecture.In Proceedings of the 13th International Conference onParallel Architectures and Compilation Techniques. IEEEComputer Society, 2004.

[6] S. Cho and L. Jin. Managing distributed, shared l2caches through os-level page allocation. In MICRO,2006.

[7] F. Farhat, D. Z. Tootaghaj, Y. He, A. Sivasubramaniam,M. T. Kandemir, and C. R. Das. Stochastic modelingand optimization of stragglers. IEEE Transactions onCloud Computing, 2016.

[8] D. Z. Tootaghaj and F. Farhat. Optimal placement ofcores, caches and memory controllers in network on-chip. arXiv preprint arXiv:1607.04298, 2016.

[9] D. Z. Tootaghaj, F. Farhat, M. Arjomand, P. Faraboschi,M.t T. Kandemir, A. Sivasubramaniam, and C. R. Das.Evaluating the combined impact of node architectureand cloud workload characteristics on network trafficand performance/cost. In IEEE international symposiumon Workload characterization (IISWC). IEEE, 2015.

[10] F. Farhat, D. Z. Tootaghaj, and M. Arjomand. Towardsstochastically optimizing data computing flows. arXivpreprint arXiv:1607.04334, 2016.

[11] D. Z. Tootaghaj. Evaluating Cloud Workload Characteris-tics. PhD thesis, Pennsylvania State University, 2015.

[12] C. Liu, A. Sivasubramaniam, and M. T. Kandemir.Organizing the last line of defense before hitting thememory wall for cmps. In IEEE Proceedings on Software,2004.

[13] Y. N. Patt M. K. Qureshi. Utility-based cache parti-tioning: A low-overhead, high-performance, runtimemechanism to partition shared caches. In MICRO. IEEEComputer Society, 2006.

[14] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, andP. Sadayappan. Gaining insights into multicore cachepartitioning: Bridging the gap between simulation andreal systems. In IEEE 14th International Symposium onHigh Performance Computer Architecture (HPCA), pages367–378, 2008.

[15] R. Iyer. Cqos: a framework for enabling qos in sharedcaches of cmp platforms. In ICS. ACM, 2004.

[16] N. Rafique, W. T. Lim, and M. Thottethodi. Architec-tural support for operating system-driven cmp cachemanagement. In PACT. ACM, 2006.

[17] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysisand approximation of optimal co-scheduling on chipmultiprocessors. In PACT. ACM, 2008.

[18] A. Yekkehkhany, A. Hojjati, and M. H. Hajiesmaili.Gb-pandas: Throughput and heavy-traffic optimalityanalysis for affinity scheduling. In IFIP Proceedings ofPerformance (IFIP Performance 2017), 2017.

[19] Q. Xie, A. Yekkehkhany, and Y. Lu. Scheduling withmulti-level data locality: Throughput and heavy-trafficoptimality. In INFOCOM. IEEE, 2016.

[20] Y. Zhou and D. Wentzlaff. The sharing architecture:sub-core configurability for iaas clouds. In ACMSIGARCH Computer Architecture News. ACM, 2014.

[21] D. Z. Tootaghaj, F. Farhat, M. R. Pakravan, and M. R.Aref. Game-theoretic approach to mitigate packetdropping in wireless ad-hoc networks. In IEEE CCNC,2011.

[22] D. Z. Tootaghaj, F. Farhat, M. R. Pakravan, and M. R.Aref. Risk of attack coefficient effect on availability ofad-hoc networks. In IEEE CCNC, 2011.

[23] K. Kotobi and S. G. Bilen. Spectrum sharing via hy-brid cognitive players evaluated by an m/d/1 queuingmodel. EURASIP Journal on Wireless Communicationsand Networking, 2017.

[24] K. Kotobi and S. G. Bilen. Introduction of vigilanteplayers in cognitive networks with moving greedyplayers. In Vehicular Technology Conference (VTC Fall).IEEE, 2015.

[25] G. Kesidis, K. Kotobi, and C. Griffin. Distributed alohagame with partially rule-based cooperative, greedy,and vigilante players. Department of Computer Scienceand Engineering, Penn State University, Tech. Rep. CSE,2013.

[26] A. Kurve, K. Kotobi, and G. Kesidis. An agent-basedframework for performance modeling of an optimisticparallel discrete event simulator. Complex AdaptiveSystems Modeling, 2013.

[27] N. Nasiriani, C. Wang, G. Kesidis, and B. Urgaonkar.Using burstable instances in the public cloud: Why,when and how? Proceedings of the ACM on Measurement

Page 13: 1 CARMA: Contention-aware Auction-based Resource ... · resource assignment problem in the context of marketing. Applications’ demand regardless of the global optimization objective

13

and Analysis of Computing Systems, 2017.[28] C. Wang, N. Nasiriani, G. Kesidis, B. Urgaonkar,

Q. Wang, L. Y. Chen, A. Gupta, and R. Birke. Recoupingenergy costs from cloud tenants: Tenant demand re-sponse aware pricing design. In Proceedings of the 2015ACM Sixth International Conference on Future EnergySystems. ACM, 2015.

[29] M. J. Osborne and A. Rubinstein. A course in gametheory. MIT press, 1994.

[30] S. M. Zahedi and B. C. Lee. Ref: Resource elasticityfairness with sharing incentives for multiprocessors.ACM SIGARCH Computer Architecture News, 2014.

[31] Q. Llull, S. Fan, S. M. Zahedi, and B. C. Lee. Cooper:Task colocation with cooperative games. In IEEEInternational Symposium on High Performance ComputerArchitecture (HPCA). IEEE, 2017.

[32] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski,S. Shenker, and I. Stoica. Dominant resource fairness:Fair allocation of multiple resource types. In NSDI,2011.

[33] S. M. Zahedi and B. C. Lee. Sharing incentives and fairdivision for multiprocessors. IEEE Micro, 2015.

[34] S. Fan, S. M. Zahedi, and B. C. Lee. The computationalsprinting game. In ACM SIGOPS Operating SystemsReview. ACM, 2016.

[35] D. P. Bertsekas. Network Optimization: continuous anddiscrete methods. Athena Scientific, 1998.

[36] A. S. Kyle. Continuous auctions and insider trading.Econometrica: Journal of the Econometric Society, 1985.

[37] Cristina Nader Vasconcelos and Bodo Rosenhahn. Bi-partite graph matching computation on gpu. In EMM-CVPR. Springer, 2009.

[38] J. Ferber. Multi-agent systems: an introduction to dis-tributed artificial intelligence, volume 1. Addison-WesleyReading, 1999.

[39] http:/manetovo.org.[40] http://www.spec.org/spec2006.[41] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt,

A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna,S. Sardashti, et al. The gem5 simulator. ACM SIGARCHComputer Architecture News, 2011.

[42] http://gem5.org/.

[43] G. E. Suh, S. Devadas, and L. Rudolph. A new memorymonitoring scheme for memory-aware scheduling andpartitioning. In High-Performance Computer Architecture,2002. Proceedings. Eighth International Symposium on.IEEE, 2002.

[44] G. E. Suh, L. Rudolph, and S. Devadas. Dynamicpartitioning of shared cache memory. The Journal ofSupercomputing, 2004.

[45] D. K Tam, R. Azimi, L. B. Soares, and M. Stumm.Rapidmrc: approximating l2 miss rate curves on com-modity systems for online optimizations. In ACMSIGARCH Computer Architecture News, 2009.

[46] Krishna. V. Auction theory. Academic press, 2009.[47] S. Parsons, J. A. Rodriguez-Aguilar, and M. Klein.

Auctions and bidding: A guide for computer scientists.ACM Computing Surveys (CSUR), 2011.

Farshid Farhat is a PhD candidate at the Schoolof Electrical Engineering and Computer Science,The Pennsylvania State University. He obtainedhis B.Sc., M.Sc., and Ph.D. degrees in ElectricalEngineering from Sharif University of Technol-ogy, Tehran, Iran. His current research interestsinclude resource allocation in parallel, distributedsystems, computer vision and image process-ing. He is working on the modeling and analy-sis of image composition and aesthetics usingdeep learning and image retrieval on different

platforms ranging from smartphones to high-performance computingclusters.

Diman Zad Tootaghaj is a Ph.D. student inthe department of computer science and en-gineering at the Pennsylvania State University.She received B.S. and M.S. degrees in ElectricalEngineering from Sharif University of Technol-ogy, Iran in 2008 and 2011 and a M.S. degreein Computer Science and Engineering from thePennsylvania State University in 2015. She is amember of Institute for Networking and Secu-rity Research (INSR) under supervision of Prof.Thomas La Porta (advisor), Dr. Ting He (co-

advisor), and Dr. Novella Bartolini. Her current research interests in-clude computer network, recovery approaches, distributed systems, andstochastic analysis.