A Case for NUMA-aware Contention Management on Multicore Systems · 2019-02-25 · A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov Simon Fraser

A Case for NUMA-aware Contention Management on Multicore Systems

Sergey BlagodurovSimon Fraser University

Sergey ZhuravlevSimon Fraser University

Mohammad DashtiSimon Fraser University

Alexandra FedorovaSimon Fraser University

Abstract

On multicore systems, contention for shared resourcesoccurs when memory-intensive threads are co-scheduledon cores that share parts of the memory hierarchy, suchas last-level caches and memory controllers. Previouswork investigated how contention could be addressedvia scheduling. A contention-aware scheduler separatescompeting threads onto separate memory hierarchy do-mains to eliminate resource sharing and, as a conse-quence, to mitigate contention. However, all previouswork on contention-aware scheduling assumed that theunderlying system is UMA (uniform memory access la-tencies, single memory controller). Modern multicoresystems, however, are NUMA, which means that theyfeature non-uniform memory access latencies and multi-ple memory controllers.

We discovered that state-of-the-art contention man-agement algorithms fail to be effective on NUMA sys-tems and may even hurt performance relative to a defaultOS scheduler. In this paper we investigate the causes forthis behavior and design the first contention-aware algo-rithm for NUMA systems.

1 Introduction

Contention for shared resources on multicore proces-sors is a well-known problem. Consider a typical mul-ticore system, schematically depicted in Figure 1, wherecores share parts of the memory hierarchy, which weterm memory domains, and compete for resources suchas last-level caches (LLC), system request queues andmemory controllers. Several studies investigated ways ofreducing resource contention and one of the promisingapproaches that emerged recently is contention-awarescheduling [23, 10, 16]. A contention-aware scheduleridentifies threads that compete for shared resources of amemory domain and places them into different domains.In doing so the scheduler can improve the worst-case

performance of individual applications or threads by asmuch as 80% and the overall workload performance byas much as 12% [23].

Unfortunately studies of contention-aware algorithmsfocused primarily on UMA (Uniform Memory Access)systems, where there are multiple shared LLCs, but onlya single memory node equipped with the single memorycontroller, and memory can be accessed with the samelatency from any core. However, new multicore sys-tems increasingly use the Non-Uniform Memory Access(NUMA) architecture, due to its decentralized and scal-able nature. In modern NUMA systems, there are mul-tiple memory nodes, one per memory domain (see Fig-ure 1). Local nodes can be accessed in less time than re-mote ones, and each node has its own memory controller.When we ran the best known contention-aware sched-ulers on a NUMA system, we discovered that not only dothey not manage contention effectively, but they some-times even hurt performance when compared to a de-fault contention-unaware scheduler (on our experimentalsetup we observed as much as 30% performance degra-dation caused by a NUMA-agnostic contention-aware al-gorithm relative to the default Linux scheduler). Thefocus of our study is to investigate (1) why contention-management schedulers that targeted UMA systems failto work on NUMA systems and (2) devise an algorithmthat would work effectively on NUMA systems.

Why existing contention-aware algorithms may hurtperformance on NUMA systems: Existing state-of-the-art contention-aware algorithms work as follows onNUMA systems. They identify threads that are sharinga memory domain and hurting each other’s performanceand migrate one of the threads to a different domain. Thismay lead to a situation where a thread’s memory is lo-cated in a different domain than that in which the threadis running. (E.g., consider a thread being migrated fromcore C1 to core C5 in Figure 1, with its memory being lo-cated in Memory Node #1). We refer to migrations thatmay place a thread into a domain remote from its mem-

MemoryNode#1 MemoryNode#2

MemoryNode#3 MemoryNode#4

Domain1

L3cacheC5

L3cacheC6

C8C7

L3cacheC13

L3cacheC14

C16C15

Domain2

Domain3 Domain4

C1 C2

C4C3

C9C10

C12C11

Figure 1: A schematic view of a system with four mem-ory domains and four cores per domain. There are 16cores in total, and a shared L3 cache per domain.

ory NUMA-agnostic migrations.NUMA-agnostic migrations create several problems,

an obvious one being that the thread now incurs a higherlatency when accessing its memory. However, contraryto a commonly held belief that remote access latency –i.e., the higher latency incurred when accessing a remotedomain relative to accessing a local one – would be thekey concern in this scenario, we discovered that NUMA-agnostic migrations create other problems, which are farmore serious than remote access latency. In particular,NUMA-agnostic migrations fail to eliminate contentionfor some of the key hardware resources on multicoresystems and create contention for additional resources.That is why existing contention-aware algorithms thatperform NUMA-agnostic migrations not only fail to beeffective, but can substantially hurt performance on mod-ern multicore systems.

Challenges in designing contention-aware algo-rithms for NUMA systems: To address this problem, acontention-aware algorithm on a NUMA system mustmigrate the memory of the thread to the same domainwhere it migrates the thread itself. However, the need tomove memory along with the thread makes thread mi-grations costly. So the algorithm must minimize threadmigrations, performing them only when they are likely tosignificantly increase performance, and when migratingmemory it must carefully decide which pages are mostprofitable to migrate. Our work addresses these chal-lenges.

The contributions of our work can be summarized asfollows:

• We discover that contention-aware algorithms

known to work well on UMA systems may actuallyhurt performance on NUMA systems.

• We identify NUMA-agnostic migration as the causefor this phenomenon and identify the reasons whyperformance degrades. We also show that remoteaccess latency is not the key reason why NUMA-agnostic migration hurt performance.

• We design and implement Distributed IntensityNUMA Online (DINO), a new contention-aware al-gorithm for NUMA systems. DINO prevents super-fluous thread migrations, but when it does performmigrations, it moves the memory of the threadsalong with the threads themselves. DINO performsup to 20% better than the default Linux schedulerand up to 50% better than Distributed Intensity,which is the best contention-aware scheduler knownto us [23].

• We devise a page migration strategy that works on-line, uses Instruction-Based Sampling, and elimi-nates on average 75% of remote accesses.

Our algorithms were implemented at user-level, sincemodern operating systems typically export the interfacesfor implementing the desired functionality. If needed, thealgorithms can also be moved into the kernel itself.

The rest of this paper is organized as follows. Sec-tion 2 demonstrates why existing contention-aware al-gorithms fail to work on NUMA systems. Section 3presents and evaluates DINO. Section 4 analyzes mem-ory migration strategies. Section 5 provides the exper-imental results. Section 6 discusses related work, andSection 7 summarizes our findings.

2 Why existing algorithms do not work onNUMA systems

As we explained in the introduction, existing contention-aware algorithms perform NUMA-agnostic migration,and so a thread may end up running on a node remotefrom its memory. This creates additional problems be-sides introducing remote latency overhead. In particu-lar, NUMA-agnostic migrations fail to eliminate memorycontroller contention, and create additional interconnectcontention. The focus of this section is to experimentallydemonstrate why this is the case.

To this end, in Section 2.1, we quantify how con-tention for various shared resources contributes to per-formance degradation that an application may experi-ence as it shares the hardware with other applications.We show that memory controller contention and inter-connect contention are the most important causes of per-formance degradation when an application is running re-motely from its memory. Then, in Section 2.2 we use

2

Shared L3 Cache

System Request Interface

Core 1 Core 2 Core 3 Core 4

L2 cache L2 cache L2 cache L2 cache

Memory Controller

HyperTransport™

to DRAM to other chips

System Request InterfaceCrossbar switch

Figure 2: A schematic view of a system used in thisstudy. A single domain is shown.

0)

Base Case

No Cache Contention

Cache Contention

T

MT

1)

C

MC TC

MT

4)

MC

T

MT, MC

C TC

MC

5)

MT

T2) C TC6)

MCMCMCMC

ICICICIC

CACACACA

CACACACARLRLRLRL

CACACACA

NONENONENONENONE

MC MT MT,MC

T3) C

MT,MC

TC7)

MT,MC

TCPU running application T

Memory node holding T’s memory

MT

C CPU running applications C

Memory node holding memory of apps. C

MC

ICICICICRLRLRLRL

MCMCMCMCRLRLRLRL

CACACACAMCMCMCMC

CA,CA,CA,CA,IC,MCIC,MCIC,MCIC,MCRLRLRLRL

Figure 3: Placement of threads and memory in all exper-imental configurations.

this finding to explain why NUMA-agnostic migrationscan be detrimental to performance.

2.1 Quantifying causes of contention

In this section we quantify the effects of performancedegradation on multicore NUMA systems depending onhow threads and their memory are placed in memory do-mains. For this part of the study, we use benchmarksfrom the SPEC CPU2006 benchmark suite. We performexperiments on a Dell PowerEdge server equipped with

four AMD Barcelona processors running at 2.3GHz, and64GB of RAM, 16GB per domain. The operating systemis Linux 2.6.29.6. Figure 2 schematically represents thearchitecture of each processor in this system.

We identify four sources of performance degradationthat can occur on modern NUMA systems, such as thoseshown in Figures 1 and 2:

• Contention for the shared last-level cache (CA).This also includes contention for the system requestqueue and the crossbar.

• Contention for the memory controller (MC). Thisalso includes contention for the DRAM prefetchingunit.

• Contention for the inter-domain interconnect (IC).

• Remote access latency, occurring when a thread’smemory is placed in a remote node (RL).

To quantify the effects of performance degradationcaused by these factors we use the methodology depictedin Figure 3. We run a target application, denoted asT with a set of three competing applications, denotedas C. The memory of the target application is denotedMT , and the memory of the competing applications isdenoted MC. We vary (1) how the target application isplaced with respect to its memory, (2) how it is placedwith respect to the competing applications, and (3) howthe memory of the target is placed with respect to thememory of the competing applications. Exploring per-formance in these various scenarios allows us to quantifythe effects of NUMA-agnostic thread placement.

Figure 3 summarizes the relative placement of mem-ory and applications that we used in our experiments.Next to each scenario we show factors affecting theperformance of the target application: CA, IC, MC orRL. For example, in Scenario 0, an application runscontention-free with its memory on a local node, so noperformance-degrading factors are present. We term thisthe base case and compare to it the performance in othercases. The scenarios where there is cache contention areshown on the right and the scenarios where there is nocache contention are shown on the left.

We used two types of target and competing appli-cations, classified according to their memory intensity:devil and turtle. The terminology is borrowed from anearlier study on application classification [21]. Devilsare memory intensive: they generate a large number ofmemory requests. We classify an application as a devilif it generates more than two misses per 1000 instruc-tions (MPI). Otherwise, an application is deemed a turtle.We further divide devils into two subcategories: regulardevils and soft-devils. Regular devils have a miss ratethat exceeds 15 misses per 1000 instructions. Soft-devils

3

have an MPI between two and 15. Solo miss rates, ob-tained when an application runs on a machine alone, areused for classification.

We experimented with nine different target applica-tions: three devils (mcf, omnetpp and milc), three soft-devils, (gcc, bwaves and bzip) and three turtles (povray,calculix and h264).

Figure 4 shows how an application’s performance de-grades in Scenarios 1-7 from Figure 3 relative to Sce-nario 0. Performance degradation, shown on the y-axis,is measured as the increase in completion time relativeto Scenario 0. The x-axis shows the type of competingapplications that were running concurrently to generatecontention: devil, soft-devil, or turtle.

These results demonstrate a very important point ex-hibited in Scenario 3: when a thread runs alone on amemory node (i.e., there is no contention for cache),but its memory is remote and is in the same domain asthe memory of another memory-intensive thread, perfor-mance degradation can be very severe, reaching 110%(see MILC, Scenario 3). One of the reasons is that thethreads are still competing for the memory controller ofthe node that holds their memory. But this is exactlythe scenario that can be created by a NUMA-agnosticmigration, which migrates a thread to a different nodewithout migrating its memory. This is the first piece ofevidence showing why NUMA-agnostic migrations willcause problems.

We now present further evidence. Using the data inthese experiments, we are able to estimate how mucheach of the four factors (CA, MC, IC, and RL) con-tributes to the overall performance degradation in Sce-nario 7 – the one where performance degradation is theworst. For that, we compare experiments that differ fromeach other precisely by one degradation factor involved.This allows us to single out the influence of this differ-entiating factor on the application performance. Figure 5shows the breakdown for the devil and soft-devil applica-tions. Turtles are not shown, because their performancedegradation is negligible. The overall degradation foreach application relative to the base case is shown at thetop of the corresponding bar. The y-axis shows the frac-tion of the total performance degradation that each factorcauses. Since contention causing factors on a real systemoverlap in complex and integrated ways, it is not possi-ble to obtain a precise separation. These results are anapproximation that is intended to direct attention to thetrue bottlenecks in the system.

The results show that of all performance-degradingfactors contention for cache constitutes only a very smallpart, contributing at most 20% to the overall degrada-tion. And yet, NUMA-agnostic migrations eliminateonly contention for the shared cache (CA), leaving themore important factors (MC, IC, RL) unaddressed! Since

the memory is not migrated with the thread, severalmemory-intensive threads could still have their mem-ory placed in the same memory node and so they wouldcompete for the memory controller when accessing theirmemory. Furthermore, a migrated thread could be sub-ject to the remote access latency, and because a threadwould use the inter-node interconnect to access its mem-ory, it would be subject to the interconnect contention.In summary, NUMA-agnostic migrations fail to elimi-nate or even exacerbate the most crucial performance-degrading factors: MC, IC, RL.

2.2 Why existing contention managementalgorithms hurt performance

Now that we are familiar with causes of performancedegradation on NUMA systems, we are ready to explainwhy existing contention management algorithms fail towork on NUMA systems. Consider the following ex-ample. Suppose that two competing threads A and Brun on cores C1 and C2 on a system shown in Figure 1.A contention-aware scheduler would detect that A andB compete and migrate one of the threads, for examplethread B, to a core in a different memory domain, for ex-ample core C5. Now A and B are not competing for thelast-level (L3) cache, and on UMA systems this would besufficient to eliminate contention. But on a NUMA sys-tem shown in Figure 1, A and B are still competing forthe memory controller at Memory Node #1 (MC in Fig-ure 5), assuming that their memory is physically locatedin Node #1. So by simply migrating thread B to anothermemory domain, the scheduler does not eliminate one ofthe most significant sources of contention – contentionfor the memory controller.

Furthermore, the migration of thread B to a differentmemory domain creates two additional problems, whichdegrade thread B’s performance. Assuming that threadB’s memory is physically located in Memory Node #1(all operating systems of which we are aware would al-locate B’s memory on Node #1 if B is running on a coreattached to Node #1 and then leave the memory on Node#1 even after thread migration), B is now suffering fromtwo additional sources of overhead: interconnect con-tention and remote latency (labeled IC and RL respec-tively in Figure 5). Although remote latency is not acrucially important factor, interconnect contention couldhurt performance quite significantly.

To summarize, NUMA-agnostic migrations in the ex-isting contention management algorithms cause the fol-lowing problems, listed in the order of severity accordingto their effect on performance: (1) They fail to eliminatememory-controller contention; (2) They may create ad-ditional interconnect contention; (3) They introduce re-mote latency overhead.

4

60

80

100

120

140

160D

eg

rad

ati

on

du

e t

o c

on

ten

tio

n (

%)

Scenario 1

Scenario 3

Scenario 2MCF

GCC

NO CACHE CONTENTION

OMNETPP

MILC

BWAVES POVRAY

-40

-20

0

20

40

60

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

De

gra

da

tio

n d

ue

to

co

nte

nti

on

(%

)

Type of Competing Application

BZIPCALCULIX

H264

(a) No cache contention

100

150

200

De

gra

da

tio

n d

ue

to

co

nte

nti

on

(%

)

Scenario 4

Scenario 6

Scenario 7

Scenario 5

CACHE CONTENTION

MCF

OMNETPP

MILC

GCC

BWAVES

BZIP POVRAY

-50

0

50

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

DE

VIL

S

S-D

EV

ILS

TU

RT

LES

De

gra

da

tio

n d

ue

to

co

nte

nti

on

(%

)

Type of Competing Application

CALCULIX

H264

(b) Cache contention

Figure 4: Performance degradation due to contention, cases 1-7 from Figure 3 relative to running contention free (case0).

0.6

0.7

0.8

0.9

1

De

gra

da

tio

n f

ract

ion

du

e t

o e

ach

fa

cto

r IC MC CACHE RL

138% 182% 153% 113% 79% 75%

0

0.1

0.2

0.3

0.4

0.5

0.6

MCF OMNETPP MILC GCC BWAVES BZIP

De

gra

da

tio

n f

ract

ion

du

e t

o e

ach

fa

cto

r

Figure 5: Contribution of each factor to the worst-caseperformance degradation.

3 A Contention-Aware Scheduling Algo-rithm for NUMA Systems

We design a new contention-aware scheduling algo-rithm for NUMA systems. We borrow the contention-modeling heuristic from the Distributed Intensity (DI)algorithm, because it was shown to perform within 3%percent of optimal on non-NUMA systems1 [23]. Othercontention aware algorithms use similar principles asDI [10, 16].

We begin by explaining how the original DI algorithmworks (Section 3.1). For clarity we will refer to it fromnow on as DI-Plain. We proceed to show that simplyextending DI-Plain to migrate memory – this version ofthe algorithm is called DI-Migrate – is not sufficient toachieve good performance on NUMA systems. We con-clude with the description of our new DI-NUMA Online,

1Although some experiments with DI reported in [23] were per-formed on a NUMA machine, the experimental environment was con-figured so as to eliminate any effects of NUMA.

or DINO, that in addition to migrating thread memoryalong with the thread eliminates superfluous migrationsand unlike other algorithms improves performance onNUMA systems.

3.1 DI-Plain

DI-Plain works by predicting which threads will inter-fere if co-scheduled on the same memory domain andplacing those threads on separate domains. Predictionis performed online, based on performance characteris-tics of threads measured via hardware counters. To pre-dict interference, DI uses the miss-rate heuristic – a mea-sure of last-level cache misses per thousand instructions,which includes the misses resulting from hardware pre-fetch requests. As we and other researchers showed inearlier work the miss-rate heuristic is a good approxima-tion of contention: if two threads have a high LLC missrate they are likely to compete for shared CPU resourcesand degrade each other’s performance [23, 2, 10, 16].

Even though the miss rate does not capture the fullcomplexity of thread interactions on modern multicoresystems, it is an excellent predictor of contention formemory controllers and interconnects – key resourcebottlenecks on these systems – because it reflects howintensely threads use these resources. Detailed studyshowing why the miss rate heuristic works well andhow it compares to other modeling heuristics is reportedin [23, 2].

DI-Plain continuously monitors the miss rates of run-ning threads. Once in a while (every second in the orig-inal implementation), it sorts the threads according totheir miss rates, and assigns them to memory domains soas to co-schedule low-miss-rate threads with high-miss-rate threads. It does so by first iterating over the sortedthreads starting from the most memory-intensive (the one

5

with the highest miss rate) and placing each thread in aseparate domain, iterating over domains consecutively.This way it separates memory-intensive threads. Then ititerates over the array from the other end, starting fromthe least memory-intensive thread, placing each on anunused core in consecutive domains. Then it iteratesfrom the other end of the array again, and continues alter-nating iterations until all threads have been placed. Thisstrategy results in balancing the memory intensity acrossdomains. DI-Plain performs no memory migration whenit migrates the threads.

Existing operating systems (Linux, Solaris) would notmove the thread’s memory to another node when a threadis moved to a new domain. Linux performs new memoryallocations in the new domain, but will leave the mem-ory allocated before migration in the old one. Solariswill act similarly2. So on either of these systems, if thethread after migration keeps accessing the memory thatwas allocated on another domain, it will cause negativeperformance effects described in Section 2.

3.2 DI-Migrate

Our first (and obvious) attempt to make DI-PlainNUMA-aware was to make it migrate the thread’s mem-ory along with the thread. We refer to this “intermedi-ate” algorithm in our design exploration as DI-Migrate.The description of the memory migration algorithm isdeferred until Section 4, but the general idea is that itdetects which pages are actively accessed and migratesthem to the new node along with a chunk of surroundingpages. For now we present a few experiments comparingDI-Plain with DI-Migrate. Our experiments will revealthat memory migration is insufficient to make DI-Plainwork well on NUMA systems, and this will motivate thedesign of DINO.

Our experiments were performed on the same systemas described in Section 2.1.

The benchmarks shown in this section are scientificapplications from SPEC CPU2006 and SPEC MPI2007suites with reference sets in both cases. (In a latersection we also show results for the multithreadedApache/MySQL workload.) We evaluated scientific ap-plications for two reasons. First, they are CPU-intensiveand often suffer from contention. Second, they wereof interest for our partner Western Canadian ResearchGrid (WestGrid) – a network of compute clusters usedby scientists at Canadian universities and in particular

2Solaris will perform new allocations in the new domain if athread’s home lgroup – a representation of a thread’s home memorydomain – is reassigned upon migration, but will not move the mem-ory allocated prior to home lgroup reassignment. If the lgroup is un-changed, even new memory allocations will be performed in the olddomain.

by physicists involved in ATLAS, an international par-ticle physics experiment at the Large Hadron Colliderat CERN. The WestGrid site at our university is inter-ested in deploying contention management algorithms ontheir clusters. Prospect of adoption of contention man-agement algorithms in a real setting also motivated theiruser-level implementation – not requiring a custom ker-nel makes the adoption less risky. Our algorithms are im-plemented on Linux as user-level daemons that measurethreads’ miss rates using perfmon, migrate threads us-ing scheduling affinity system calls, and move memoryusing the numa migrate pages system call.

For SPEC CPU we show one workload for brevity;complete results are presented in Section 5. All bench-marks in the workload are launched simultaneously andif one benchmark terminates it is restarted until eachbenchmark completes three times. We use the result ofthe second execution for each benchmark, and performthe experiment ten times, reporting the average of theseruns.

For SPEC MPI we show results for eleven differentMPI jobs. In each experiment we run a single job, eachcomprised of 16 processes. We perform ten runs of eachjob and present the average completion times.

We compare performance under DI-Plain and DI-Migrate relative to the default Linux Completely FairScheduler, to which we refer as Default. Standard de-viation across the runs is under 6% for the DI algo-rithms. Deviation under Default is necessarily high, be-cause being unaware of resource contention it may forcea low-contention thread placement in one run and a high-contention mapping in another. Detailed comparison ofdeviations under different schedulers is also presented inSection 5.

Figures 6 and 7 show the average completion time im-provement for the SPEC CPU and SPEC MPI workloadsrespectively (higher numbers are better) under DI algo-rithms relative to Default. We draw two important con-clusions. First of all, DI-Plain often hurts performanceon NUMA systems, sometimes by as much as 36%. Sec-ond, while DI-Migrate eliminates performance loss andeven improves it for SPEC CPU workloads, it fails to ex-cel with SPEC MPI workloads, hurting performance byas much as 25% for GAPgeofem.

Our investigation revealed DI-Migrate migrated pro-cesses a lot more frequently in the SPEC MPI work-load than in the SPEC CPU workload. While fewer than50 migrations per process per hour were performed forSPEC CPU workloads, but as many as 400 (per process)were performed for SPEC MPI! DI-Migrate will migratea thread to a different core any time its miss rate (and itsposition in the array sorted by miss rates) changes. Forthe dynamic SPEC MPI workload this happened ratherfrequently and led to frequent migrations.

6

‐40

‐30

‐20

‐10

0

10

20

30

%im

provem

ent

DI‐PlainDI‐Migrate

Figure 6: Improvement of completion time under DI-Plain and DI-Migrate relative to the Default for a SPECCPU 2006 workload.

‐30‐25‐20‐15‐10‐50510

milc

leslie

Gem

sFDTD

fds

pop

tachyon

lammps

GAPgeo

fem

socorro

zeusmp lu

%im

provem

ent

DI‐PlainDI‐Migrate

Figure 7: Improvement of completion time under DI-Plain and DI-Migrate relative to Default for eleven SPECMPI 2007 jobs.

Unlike on UMA systems, thread migrations are notcheap on NUMA systems, because you also have tomove the memory of the thread. No matter how efficientmemory migrations are, they will never be completelyfree, so it is always worth reducing the number of migra-tions to the minimum, performing them only when theyare likely to result in improved performance. Our analy-sis of DI-Migrate behaviour for the SPEC MPI workloadrevealed that oftentimes migrations resulted in a threadplacement that was not better in terms of contention thanthe placement prior to migration. This invited opportuni-ties for improvement, which we used in design of DINO.

3.3 DINO3.3.1 Motivation

DINO’s key novelty is in eliminating superfluous threadmigrations – those that are not likely to reduce con-tention. Recall that DI-Plain (Section 3.1) triggers mi-grations when threads change their miss rates and theirrelative positions in the sorted array. Miss rates maychange rather often, but we found that it is not necessaryto respond to every change in order to reduce contention.

This insight comes from the observation that whilethe miss rate is an excellent heuristic for predicting rel-

ative contention at coarse granularity (and that is why itwas shown to perform within 3% of the optimal oracularscheduler in DI) it does not perfectly predict how con-tention is affected by small changes in the miss rate.

Figure 8 illustrates this point. It shows on the x-axisSPEC CPU 2006 applications sorted in the decreasing or-der by their performance degradation when co-scheduledon the same domain with three instances of itself, relativeto running solo. The bars show the miss rates and the lineshows the degradations3. In general, with the exceptionof one outlier mcf, if one application has a much highermiss rate than another, it will have a much higher degra-dation. But if the difference in the miss rates is small, it isdifficult to predict the relative difference in degradations.

What this means is that it is not necessary for thescheduler to migrate threads upon small changes in themiss rate, only upon the large ones.

5

10

15

20

25

30

35

40

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

LLC

mis

ses

pe

r 1

00

0 i

nst

ruct

ion

s

De

gra

da

tio

n (

%)

MISSES per 1000 instr.

degradation

0

5

10

15

20

25

30

35

40

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

LLC

mis

ses

pe

r 1

00

0 i

nst

ruct

ion

s

De

gra

da

tio

n (

%)

MISSES per 1000 instr.

degradation

Figure 8: Performance degradation due to contention andmiss rates for SPEC CPU2006 applications.

3.3.2 Thread classification in DINO and multi-threaded support

To build upon this insight, we design DINO to organizethreads into broad classes according to their miss rates,and to perform migrations only when threads changetheir class, while trying to preserve thread-core affinitieswhenever possible. Classes are defined as follows (again,we borrow the animalistic classification from previouswork):

Class 1: turtles – fewer than two LLC misses per 1000instructions.Class 2: devils – 2-100 LLC misses per 1000 instruc-tions.Class 3: super-devils – more than 100 LLC misses per1000 instructions.

Threshold values for classes were chosen for our tar-get architecture. Values for other architectures should be

3We omit several benchmarks whose counters failed to record dur-ing the experiment.

7

chosen by examining the relationship between the missrates and degradations on that architecture.

Before we describe DINO in detail, we explain thespecial new features in DINO to deal with multithreadedapplications.

First of all, DINO tries to co-schedule threads of thesame application on the same memory domain, providedthat this does not conflict with DINO’s contention-awareassignment (described below). This assumes that per-formance improvement from co-operative data sharingwhen threads are co-scheduled on the same domain aremuch smaller than the negative effects of contention.This is true for many applications [22]. However, whenthis assumption does not hold, DINO can be extended topredict when co-scheduling threads on the same domainis more beneficial than separating them, using techniquesdescribed in [9] or [19].

When it is not possible to co-schedule all threads in anapplication on the same domain, and if threads activelyshare data, they will put pressure on memory controllerand interconnects. While there is not much the sched-uler can do in this situation (re-designing the applicationis the best alternative), it must at least avoid migratingthe memory back and forth, so as not to make the perfor-mance worse. Therefore, DINO detects when the mem-ory is being “ping-ponged” between nodes and discon-tinues memory migration in that case.

3.3.3 DINO algorithm description

We now explain how DINO works using an example.In every rebalancing interval, set to one second in our

implementation, DINO reads the miss rate of each threadfrom hardware counters. It then determines each thread’sclass based on its miss rate. To reduce the influenceof sudden spikes, the thread only changes the class if itspent at least 7 out of the last 10 intervals with the mis-srate from the new class. Otherwise, the thread’s classremains the same. We save this data as an array of tuples<new class, new processID, new threadID>, sorted bymemory-intensity of the class (e.g., super-devils, fol-lowed by devils and followed by turtles). Suppose wehave a workload of eight threads containing two super-devils (D), three devils (d) and three turtles (t). Threadsnumbered <0, 3, 4, 5> are part of process 0. The re-maining threads, numbered 1, 2, 6 and 7 each belong toa separate process, numbered 1, 2, 3 and 4 respectively4. Then the sorted tuple array will look like this:

new_class: D D d d d t t tnew_processID: 0 4 0 2 3 0 0 1new_threadID: 0 7 4 2 6 3 5 1

4DINO assigns a unique thread ID to each thread in the workload.

DINO then proceeds with the computation of theplacement layout for the next interval. The placementlayout defines how threads are placed on cores. It iscomputed by taking the most aggressive class instance(a super devil in our example) and placing it on a core inthe first memory domain dom0, then the second aggres-sive (also a super devil) – on a core in the second domainand so on until we reach the last domain. Then we iter-ate from the opposite end of the array (starting with theleast memory-intensive instance) and spread them acrossdomains starting with dom3. We continue alternating be-tween two ends of the array until all class instances havebeen placed on cores. In our example, for the NUMAmachine with four memory domains and two cores perdomain, the layout will be computed as follows:

domain: dom0 dom1 dom2 dom3new_core: 0 1 2 3 4 5 6 7layout: D t D t d t d d

Although this example assumes that the number ofthreads equals the number of cores, the algorithm gen-eralizes for scenarios when the number of threads issmaller or greater than the number of cores. In the lat-ter case, each core will have T “slots” that can be filledwith threads, where T = num threads/num cores,and instead of taking one class-instance from the arrayat a time, DINO will take T .

Now that we determined the layout for class-instances,we are yet to decide which thread will fill each core-classslot – any thread of the given class can potentially fillthe slot corresponding to the class. In making this deci-sion, we would like to match threads to class instancesso as to minimize the number of migrations. And toachieve that, we refer to the matching solution for theold rebalancing interval, saved in the form of a tuple ar-ray: <old domain, old core, old class, old processID,old threadID> for each thread.

Migrations are deemed superfluous if they changethread-core assignment, while not changing the place-ment of class-instances on cores. For example, if a threadthat happens to be a devil (d) runs on a core that hasbeen assigned the (d)-slot in the new assignment, it isnot necessary to migrate this thread to another core witha (d)-slot. DI-Plain did not take this into considera-tion and thus performed a lot of superfluous migrations.To avoid them in DINO we first decide the thread as-signment for any tuple that preserves core-class place-ment according to the new layout. So, if for a giventhread old core = new core and old class =new class, then the corresponding tuple in the newsolution for that thread will be <new core, new class,old processID, old threadID>.

For example, if the old solution were:

domain: dom0 dom1 dom2 dom3

8

old_core: 0 1 2 3 4 5 6 7old_class: D t d t d t d told_processID: 0 1 2 0 0 0 3 4old_threadID: 0 1 2 3 4 5 6 7

then the initial shape of the new solution would be:

domain: dom0 dom1 dom2 dom3new_core: 0 1 2 3 4 5 6 7new_class: D t D t d t d dnew_processID: 0 1 0 0 0 3new_threadID: 0 1 3 4 5 6

Then, the threads whose placement was not deter-mined in the previous step – i.e., those whose old class isnot the same as their current core’s new class, as deter-mined by the new placement, will fill the unused coresaccording to their new class:

domain: dom0 dom1 dom2 dom3new_core: 0 1 2 3 4 5 6 7new_class: D t D t d t d dnew_processID: 0 1 4 0 0 0 3 2new_threadID: 0 1 7 3 4 5 6 2

Now that the thread placement is determined, DINOmakes the final pass over the thread tuples to takecare of multithreaded applications. For each threadA it checks if there is another thread B of the samemultithreaded application (new processID(A) =new processID(B)) among the thread tuples notyet iterated so that B is not placed in the same memorydomain with A. If there is one, we check the threadsthat are placed in the same memory domain with A. Ifthere is a thread C in the same domain with A, such thatnew processID(A) != new processID(C)and new class(B) = new class(C) then weswitch tuples B and C in the new solution. In ourexample this would result in the following assignment:

domain: dom0 dom1 dom2 dom3new_core: 0 1 2 3 4 5 6 7new_class: D t D t d t d dnew_processID: 0 0 4 1 0 0 3 2new_threadID: 0 3 7 1 4 5 6 2

DINO has complexity of O(N) in the number ofthreads. Since the algorithm runs at most once a sec-ond, this has little overhead even for a large number ofthreads. We found that more frequent thread rebalanc-ing did not yield better performance. Relatively infre-quent changes of thread affinities mean that the algo-rithm is best suited for long-lived applications, such asthe scientific applications we target in our study, dataanalytics (e.g., MapReduce), or servers. When there’smore threads than cores coarse-grained rebalancing isperformed by DINO, but fine-grained time sharing ofcores between threads is performed by the kernel sched-uler. If threads are I/O- or synchronization-intensive andhave unequal sleep-awake periods, any resulting load im-balance must be corrected, e.g., as in [16].

3.3.4 DINO’s Effect on Migration Frequency

We conclude this section by demonstrating how DINOis able to reduce migration frequency relative to DI-Migrate. Table 1 shows the average number of memorymigrations per hour of execution under DI-Migrate andDINO for different applications from the workloads eval-uated in Section 3.2. The results for MPI jobs are givenfor one of its processes and not for the whole job. Dueto space limitations, we show the numbers for selectedapplications that are representative of the overall trend.The numbers show that DINO significantly reduces thenumber of migrations. As will be shown in Section 5,this results in up to 30% performance improvements forjobs in the MPI workload.

4 Memory migration

The straightforward solution to implement memory mi-gration is to migrate the entire resident set of the threadwhen the thread is moved to another domain. This doesnot work for the following reasons. First of all, for mul-tithreaded applications, even those where data sharing israre, it is difficult to determine how the resident set ispartitioned among the threads. Second, even if the appli-cation is single-threaded, if its resident set is large it willnot fit into a single memory domain, so it is not possi-ble to migrate it in its entirety. Finally, we experimen-tally found that even in cases where it is possible to mi-grate the entire resident set of a process, this can hurt per-formance of applications with large memory footprints.So in this section we describe how we designed and im-plemented a memory migration strategy that determineswhich of the thread’s pages are most profitable to migratewhen the thread is moved to a new core.

4.1 Designing the migration strategy

In order to rapidly evaluate various memory migrationstrategies, we designed a simulator based on a widelyused binary instrumentation tool for x86 binaries calledPin [15]. Using Pin, we collected memory access tracesof all SPEC CPU2006 benchmarks and then used a cachesimulator on top of Pin to determine which of those ac-cesses would be LLC misses, and so require an access tomemory.

To evaluate memory migration strategies we used ametric called Saved Remote Accesses (SRA). SRA is thepercent of the remote memory accesses that were elimi-nated using a particular memory migration strategy (af-ter the thread was migrated) relative to not migratingthe memory at all. For example, if we detect every re-mote access and migrate the corresponding page to the

9

Table 1: Average number of memory migrations per hour of execution under DI-Migrate and DINO for applicationsevaluated in Section 3.2.

SPEC CPU2006 SPEC MPI2007soplex milc lbm gamess namd leslie lamps GAPgeofem socorro lu

DI-Migrate 36 22 11 47 41 381 135 237 340 256DINO 8 6 5 7 6 2 1 3 2 1

thread’s new memory node, we are eliminating all re-mote accesses, so the SRA would be 100%.

Each strategy that we evaluated detects when a threadis about to perform an access to a remote domain, andmigrates one or more memory pages from the thread’svirtual address space associated with the requested ad-dress. We tried the following strategies: sequential-forward where K pages including and following theone corresponding to the requested address are mi-grated; sequential-forward-backward where K/2 pagessequentially preceding and K/2 pages sequentially fol-lowing the requested address are migrated; randomwhere randomly chosen K pages are migrated; pattern-based where we detect a thread’s memory-access pat-tern by monitoring its previous accesses, similarly to howhardware pre-fetchers do this, and migrate K pages thatmatch the pattern. We found that sequential-forward-backward was the most effective migration policy interms of SRA.

Another challenge in designing a memory migrationstrategy is minimizing the overhead of detecting whichof the remote memory addresses are actually being ac-cessed. Ideally, we want to be able to detect every re-mote access and migrate the associated pages. However,on modern hardware this would require unmapping ad-dress translations on a remote domain and handling apage fault every time a remote access occurs. This re-sults in frequent interrupts and is therefore expensive.

After analyzing our options we decided to use hard-ware counter sampling available on modern x86 systems:PEBS (Precise Event-Based Sampling) on Intel proces-sors and IBS (Instruction-Based Sampling) on AMD pro-cessors. These mechanisms tag a sample of instructionwith various pieces of information; load and store in-structions are annotated with the memory address.

While hardware-based event sampling has low over-head, it also provides relatively low sampling accuracy –on our system it samples less than one percent of instruc-tions. So we also analysed how SRA is affected depend-ing on the sampling accuracy as well as the number ofpages that are being migrated. The lower the accuracy,the higher the value of K (pages to be migrated) needsto be to achieve a high SRA. For the hardware samplingaccuracy that was acceptable in terms of CPU overhead(less than 1% per core), we found that migrating 4096

pages enables us to achieve the SRA as high as 74.9%.We also confirmed experimentally that this was a goodvalue for K (results shown later).

4.2 Implementation of the memory migra-tion algorithm

Our memory migration algorithm is implemented forAMD systems, and so we use IBS, which we access viaLinux performance-monitoring tool perfmon [5].

Migration in DINO is performed in a user-level dae-mon running separately from the scheduling daemon.The daemon wakes up every ten milliseconds, sets upIBS to perform sampling, reads the next sample and mi-grates the page containing the memory address in thesample (if the sampled instruction was a load or a store)along with K pages in the application address space thatsequentially precede and follow the accessed page. Pagemigration is effected using the numa move pages systemcall.

5 Evaluation

5.1 Workloads

In this section we evaluate DINO implemented using themigration strategy described in the previous section. Weevaluate three workload types: SPEC CPU2006 applica-tions, SPEC MPI2007 applications, and LAMP – Lin-ux/Apache/MySQL/PHP.

We used two experimental systems for evaluation.One was described in Section 2.1. Another one is a DellPowerEdge server equipped with two AMD Barcelonaprocessors running at 2GHz, and 8GB of RAM, 4GB perdomain. The operating system is Linux 2.6.29.6. Theexperimental design for SPEC CPU and MPI workloadswas described in Section 3.2. The LAMP workload isdescribed below.

The LAMP acronym is used to describe the applica-tion environment consisting of Linux, Apache, MySQLand PHP. The main data processing in LAMP is doneby the Apache HTTP server and the MySQL databaseengine. The server management daemons apache2 andmysqld are responsible for arranging access to the web-

10

site scripts and database files and performing the ac-tual work of data storage and retrieval. We use Apache2.2.14 with PHP version 5.2.12 and MySQL 5.0.84. Bothapache2 and mysqld are multithreaded applications thatspawn one new distinct thread for each new client con-nection. This client thread within a daemon is then re-sponsible for executing the client‘s request.

In our experiment, clients continuously retrieve fromthe Apache server various statistics about website activ-ity. Our database is populated with the data gathered bythe web statistics system for five real commercial web-sites. This data includes the information about website‘saudience activity (what pages on what website were ac-cessed, in what order, etc.) as well as the informationabout visitors themselves (client OS, user agent informa-tion, browser settings, session id retrieved from the cook-ies, etc.). The total number of records in the database ismore than 3 million. We have four Apache daemons,each responsible for handling a different type of request.There are also four MySQL daemons that perform main-tenance of the website database.

We further demonstrate the effect that the choice of K(the number of pages that are moved on every migration)has on performance of DINO. Then we compare DINOto DI-Plain, DI-Migrate and Default.

5.2 Effect of K

Two of our workloads, SPEC CPU and LAMP demon-strate the key insights, and so we focus on those work-loads. We show how performance changes as we vary thevalue of K. We compare to the scenario where DINOmigrates the thread’s entire resident set upon migratingthe thread itself. The per-process resident sets of the twochosen workloads could actually fit in a single memorynode on our system (it had 4GB per node), so whole-resident-set migration was possible. For SPEC CPUapplications, resident sets vary from under a megabyteto 1.6GB for mcf. In general, they are in hundreds ofmegabytes for memory-intensive applications and muchsmaller for others. In LAMP, MySQL’s resident set wasabout 400MB and Apache’s was 120MB.

We show average completion time improvement (forApache/MySQL this is average completion time per re-quest), worst-case execution time improvement, and de-viation improvement. Completion time improvement isthe average over ten runs. To compute the worst-case ex-ecution time we run each workload ten times and recordthe longest completion time. Improvement in deviationis the percent reduction in standard deviation of the aver-age completion time.

Figure 9 shows the results for the SPEC CPU work-loads. Performance is hurt when we migrate a smallnumber of pages, but becomes comparable to whole-

resident-set migration when K reaches 4096. Whole-resident set migration actually works quite well forthis workload, because migrations are performed infre-quently and the resident set is small.

However upon experimenting with the LAMP work-load we found that whole-resident set migration wasdetrimental to performance, most likely because the resi-dent sets were much larger and also because this is a mul-tithreaded workload where threads share data. Figure 10shows performance and deviation improvement whenK = 4096 relative to whole-resident-set migration. Per-formance is substantially improved when K = 4096.We experimented with smaller values of K, but foundno substantial differences on performance.

We conclude that migrating very large chunks of mem-ory is acceptable for processes with small resident sets,but not advisable for multithreaded applications and/orapplications with large resident sets. DINO migratesthreads infrequently, so a relatively large value of K re-sults in good performance.

‐4‐202468

10121416

%im

provem

ent

average5meimprovement(%)worst5meimprovement(%)5medevia5onimprovement(%)

Figure 10: Performance improvement with DINO forK = 4096 relative to whole-resident-set migration forLAMP.

5.3 DINO vs. other algorithmsWe compare performance under DINO, DI-Plain and DI-Migrate relative to Default, and similarly to the previoussection, report completion time improvement, worst-caseexecution time improvement and deviation improvement.

Figures 11-13 show the results for the three work-load types, SPEC CPU, SPEC MPI and LAMP respec-tively. For SPEC CPU, DI-Plain hurts completion timefor many applications, but both DI-Migrate and DINOimprove, with DINO performing slightly better than DI-Migrate for most applications. Worst-case improvementnumbers show a similar trend, although DI-Plain doesnot perform as poorly here. Improvements in the worst-case execution time indicate that a scheduler is able toavoid pathological thread assignments that create espe-cially high contention, and produce more stable perfor-mance. Deviation of running times is improved by all

11

‐35

‐30

‐25

‐20

‐15

‐10

‐5

0

5

%im

provem

ent


(a) 1 page(4KB)

‐35

‐30

‐25

‐20

‐15

‐10

‐5

0

5

%im

provem

ent


(b) 256 pages (1MB)

‐35

‐30

‐25

‐20

‐15

‐10

‐5

0

5

%im

provem

ent


(c) 4096 pages (16MB)

Figure 9: Performance improvement with DINO as K is varied relative to whole-resident-set migration for SPECCPU.

‐10

‐5

0

5

10

15

20

%average-meim

provem

ent DI‐Plain

DI‐MigrateDINO

‐40

‐30

‐20

‐10

0

10

20

30

%average.meim

provem

ent DI‐Plain

DI‐MigrateDINO

‐30

‐20

‐10

0

10

20

30

40

%average.meim

provem

ent

DI‐PlainDI‐MigrateDINO

‐20‐15‐10‐505

10152025

%average-meim

provem

ent DI‐Plain

DI‐MigrateDINO

(a) Average execution time improvement of DINO (IBS), DI-Migrate (IBS) and DI-Plain over Default Linux scheduler.

‐10‐505

101520253035

%worst.meim

provem

ent DI‐Plain

DI‐MigrateDINO

‐40‐30‐20‐100

10203040506070

%worst1meim

provem

ent

DI‐PlainDI‐MigrateDINO

‐100

10203040506070

%worst1meim

provem

ent DI‐Plain

DI‐MigrateDINO

‐10‐505

1015202530354045

%worst/meim

provem

ent DI‐Plain

DI‐MigrateDINO

(b) Worst-case execution time improvement of DINO (IBS), DI-Migrate (IBS) and DI-Plain over Default Linux scheduler.

‐4

‐2

0

2

4

6

8

%devia.on

improvem

ent

DI‐PlainDI‐MigrateDINO 0

5

10

15

20

25

%devia,on

improvem

ent DI‐Plain

DI‐MigrateDINO

‐5

0

5

10

15

20

25

%devia-on

improvem

ent DI‐PlainDI‐MigrateDINO

‐202468

101214

%devia/on

improvem

ent DI‐Plain

DI‐MigrateDINO

(c) Deviation improvement of DINO (IBS), DI-Migrate (IBS) and DI-Plain over Default Linux scheduler.

Figure 11: DINO, DI-Migrate and DI-Plain relative to Default for SPEC CPU 2006 workloads.

three schedulers relative to Default.As to SPEC MPI workloads (Figure 12) only DINO is

able to improve completion times across the board, by asmuch as 30% for some jobs. DI-Plain and DI-Migrate,on the other hand, can hurt performance by as much as20%. Worst-case execution time also consistently im-proves under DINO, while sometimes degrading underDI-Plain and DI-Migrate.

LAMP is a tough workload for DINO or any schedulerthat optimizes memory placement, because the workloadis multithreaded and no matter how you place threads

they still share data, putting pressure on interconnects.Nevertheless, DINO still manages to improve comple-tion time and worst-case execution time in some cases,to a larger extent than the other two algorithms.

5.4 DiscussionOur evaluation demonstrates that DINO is significantlybetter at managing contention on NUMA systems thanthe DI algorithm designed without NUMA awareness orDI that was simply extended with memory migration.

12

‐30‐20‐100

10203040

milc

leslie

Gem

sFDTD

fds

pop

tachyon

lammps

GAPgeo

fem

socorro

zeusmp lu

%averageCmeim

provem

ent DI‐Plain

DI‐MigrateDINO

(a) Completion time improvement

‐30‐20‐100

10203040

milc

leslie

Gem

sFDTD

fds

pop

tachyon

lammps

GAPgeo

fem

socorro

zeusmp lu

%worstCmeim

provem

ent DI‐Plain

DI‐MigrateDINO

(b) Worst-case time improvement

‐8‐6‐4‐202468

10

milc

leslie

Gem

sFDTD

fds

pop

tachyon

lammps

GAPgeo

fem

socorro

zeusmp lu%deviaDon

improvem

ent DI‐Plain

DI‐MigrateDINO

(c) Deviation improvement

Figure 12: DINO, DI-Migrate and DI-Plain relative to Default for SPEC MPI 2007.

‐6‐4‐202468

10121416

%average/meim

provem

ent DI‐Plain

DI‐MigrateDINO

(a) Completion time improvement

‐10

‐5

0

5

10

15

20

%worst-meim

provem

ent DI‐Plain

DI‐MigrateDINO

(b) Worst-case time improvement

‐3

‐2

‐1

0

1

2

3

%devia-on

improvem

ent DI‐Plain

DI‐MigrateDINO

(c) Deviation improvement

Figure 13: DINO, DI-Migrate and DI-Plain relative to Default for LAMP.

Multiprocess workloads representative of scientific Gridclusters show excellent performance under DINO. Im-provements for the challenging multithreaded workloadsare less significant as expected, and wherever degrada-tion occurs for some threads it is outweighed by perfor-mance improvements for other threads.

6 Related Work

Research on NUMA-related optimizations to systems isrich and dates back many years. Many research effortsaddressed efficient co-location of the computation andrelated memory on the same node [14, 3, 12, 19, 1, 4].More ambitious proposals aimed to holistically redesignthe operating system to dovetail with NUMA architec-tures [7, 17, 6, 20, 11]. None of the previous efforts, how-ever, addressed shared resource contention in the contextof NUMA systems and the associated challenges.

Li et al. in [14] introduced AMPS, an operating sys-tem scheduler for asymmetric multicore systems thatsupports NUMA architectures. AMPS implemented aNUMA-aware migration policy that can allow or denythread migration requested by the scheduler. The authorsused the resident set size of a thread in deciding whetheror not the OS schedule is allowed to migrate thread toa different domain. If the migration overhead were ex-pected to be high the migration would be disallowed.Our scheduler, instead of prohibiting migrations, detectswhich pages are being actively accessed and moves them

as well as surrounding pages to the new domain.LaRowe et al. [12] presented a dynamic multiple-copy

policy placement and migration policy for NUMA sys-tems. The policy periodically reevaluates its memoryplacement decisions and allows multiple physical copiesof a single virtual page. It supports both migration andreplication with the choice between the two operationsbased on reference history. A directory-based invalida-tion scheme is used to ensure the coherence of replicatedpages. The policy applies a freeze/defrost strategy: to de-termine when to defrost a frozen page and trigger reeval-uation of its placement is based on both time and refer-ence history of the page. The authors evaluate variousfine-grained page migration and/or replication strategies,however, since their test machine only has one processorper NUMA node, they do not address contention. Thestrategies developed in this work could have been veryuseful for our contention aware scheduler if the inexpen-sive mechanisms that the authors used for detecting pageaccesses were available to us. Detailed page referencehistory is difficult to obtain without hardware support;obtaining it in software may cause overhead for someworkloads.

Goglin et al. [8] developed an effective implementa-tion of the move pages system call in Linux, which al-lows the dynamic migration of large memory areas tobe significantly faster than in previous versions of theOS. This work is integrated in Linux kernel 2.6.29 [8],which we use for our experiments. The Next-touch pol-

13

icy, also introduced in the paper to facilitate thread-dataaffinity, works as follows: the application marks pagesthat it will likely access in the future as Migrate-on-next-touch using a new parameter to the madvise() systemcall. The Linux kernel then ensures that the next accessto these pages causes a special page fault resulting in thepages being migrated to their threads. The work pro-vides developers with an opportunity to improve mem-ory proximity for their programs. Our work, on the otherhand, improves memory proximity by using hardwarecounters data available on every modern machine. Noinvolvement from the developer is needed.

Linux kernel since 2.6.12 supports the cpusets mech-anism and its ability to migrate the memory of the ap-plications confined to the cpuset along with their threadsto the new nodes if the parameters of a cpuset change.Schermerhorn et al. further extended the cpuset function-ality by adding an automatic page migration mechanismto it [18]: if enabled, it migrates the memory of a threadwithin the cpuset nodes whenever the thread migrates toa core adjacent to a different node. Two options for thememory migration are possible. The first is a lazy mi-gration, when the kernel attempts to unmap any anony-mous pages in the process’s page table. When the pro-cess subsequently touches any of these unmapped pages,the swap fault handler will use the ”migrate-on-fault”mechanism to migrate the misplaced pages to the correctnode. Lazy migration may be disabled, in which case,automigration will use direct, synchronous migration topull all anonymous pages mapped by the process to newnode. The efficiency of lazy automigration is compara-ble to our memory migration solution based on IBS (weperformed experiments to verify). However, automigra-tion requires kernel modification (it is implemented as acollection of kernel patches), while our solution is imple-mented on user level. Cpuset mechanism needs explicitconfiguration from the system administrator and it doesnot perform contention management.

In [19] the authors group threads of the same applica-tion that are likely to share data onto neighbouring coresto minimize the costs of data sharing between them.They rely on several features of Performance Monitor-ing Unit unique to IBM Open-Power 720 PCs: the abil-ity to monitor CPU stall breakdown charged to differentmicroprocessor components and using the data samplingto track the sharing pattern between threads. The DINOalgorithm introduced in our work complements [19] as itis designed to mitigate contention between applications.DINO provides sharing support by attempting to groupthreads of the same application and their memory on thesame NUMA node, but as long as co-scheduling multiplethreads of the same application does not contradict witha contention-aware schedule. In order to develop a moreprecise metric that assesses the effects of performance

degradation versus the benefits from co-scheduling, wewould need stronger hardware support, such as that avail-able on IBM Open-Power 720 PCs or on the newest Ne-halem systems (as demonstrated by the member of ourteam [9]).

The VMware ESX hypervisor supports NUMA loadbalancing and automatic page migration for its vir-tual machines (VMs) in commercial systems [1]. ESXServer 2 assigns each virtual machine a home node onwhose processors a VM is allowed to run and its newly-allocated memory comes from the home node as well.Periodically, a special rebalancer module selects a VMand changes its home node to the least-loaded node. Inour work we do not consider load balancing. Instead,we make thread migration decisions based on shared re-source contention. To eliminate possible remote accesspenalties associated with accessing the memory on theold node, ESX Server 2 performs page migration fromthe virtual machine’s original node to its new home node.ESX selects migration candidates based on finding hotremotely-accessed memory from page faults. The DINOscheduler, on the other hand, identifies hot pages usingInstruction-Based Sampling. No modification to the OSis required.

The SGI Origin 2000 system [4] implemented thefollowing hardware-supported [13] mechanism for co-location of computation and memory. When the dif-ference between remote and local accesses for a givenmemory page is greater than a tunable threshold, an in-terrupt is generated to inform the operating system thatthe physical memory page is suffering an excessive num-ber of remote references and hence has to be migrated.Our solution to page migration is different in that itdetects ”hot” remotely accessed pages via Instruction-Based Sampling, and performs migration in the contextof a contention-aware scheduler.

In a series of papers [7] [17] [6] [20] the authorsdescribe a novel operating system Tornado specificallydesigned for NUMA machines. The goal of this newOS is to provide data locality and application indepen-dence for OS objects thus minimizing penalties due to re-mote memory access in a NUMA system. The K42 [11]project, which is based on Tornado, is an open-sourceresearch operating system kernel that incorporates suchinnovative design principles like structuring the systemusing modular, object-oriented code (originally demon-strated in Tornado), designing the system to scale to verylarge shared-memory multiprocessors, avoiding central-ized code paths and global locks and data structures andmany more. K42 keeps physical memory close to whereit is accessed. It uses large pages to reduce hardwareand software costs of virtual memory. K42 project hasresulted in many important contributions to Linux, onwhich our work relies. As a result, we were able to avoid

14

deleterious effects of remote memory accesses withoutrequiring changes to the applications or the operatingsystem. We believe that our NUMA contention-awarescheduling approach that was demonstrated to work ef-fectively in Linux can also be easily implemented inK42 with its inherent user-level implementation of ker-nel functionality and native performance monitoring in-frastructure.

7 Conclusions

We discovered that contention-aware algorithms de-signed for UMA systems may hurt performance on sys-tems that are NUMA. We found that contention for mem-ory controllers and interconnects occurring when threadruns remotely from its memory are the key causes. To ad-dress this problem we presented DINO: a new contentionmanagement algorithm for NUMA systems. While de-signing DINO we found that simply migrating a thread’smemory when the thread is moved to a new node is not asufficient solution; it is also important to eliminate super-fluous migrations: those that add to migration cost with-out providing the benefit. The goals for our future workare (1) devising metric for predicting a trade-off betweenperformance degradation and benefits from thread shar-ing and (2) investigate the impact of using small versuslarge memory pages during migration.

References[1] VMware ESX Server 2 NUMA Support. White paper. [Online]

Available: http://www.vmware.com/pdf/esx2 NUMA.pdf .[2] BLAGODUROV, S., ZHURAVLEV, S., AND FEDOROVA, A.

Contention-aware scheduling on multicore systems. ACM Trans.Comput. Syst. 28 (December 2010), 8:1–8:45.

[3] BRECHT, T. On the Importance of Parallel Application Place-ment in NUMA Multiprocessors. In USENIX SEDMS (1993).

[4] CORBALAN, J., MARTORELL, X., AND LABARTA, J. Eval-uation of the Memory Page Migration Influence in the SystemPerformance: the Case of the SGI O2000. In Proceedings of Su-percomputing (2003), pp. 121–129.

[5] ERANIAN, S. What can performance counters do for memorysubsystem analysis? In Proceedings of MSPC (2008).

[6] GAMSA, B., KRIEGER, O., AND STUMM, M. Optimizing IPCPerformance for Shared-Memory Multiprocessors. In Proceed-ings of ICPP (1994).

[7] GAMSA, B., KRIEGER, O., AND STUMM, M. Tornado: Maxi-mizing Locality and Concurrency in a Shared Memory Multipro-cessor Operating System. In Proceedings of OSDI (1999).

[8] GOGLIN, B., AND FURMENTO, N. Enabling High-PerformanceMemory Migration for Multithreaded Applications on Linux. InProceedings of IPDPS (2009).

[9] KAMALI, A. Sharing Aware Scheduling on Multicore Systems.Master’s thesis, Simon Fraser University, 2010.

[10] KNAUERHASE, R., BRETT, P., HOHLT, B., LI, T., AND HAHN,S. Using OS Observations to Improve Performance in MulticoreSystems. IEEE Micro 28, 3 (2008), pp. 54–66.

[11] KRIEGER, O., AUSLANDER, M., ROSENBURG, B., WIS-NIEWSKI, R. W., XENIDIS, J., DA SILVA, D., OSTROWSKI,M., APPAVOO, J., BUTRICO, M., MERGEN, M., WATERLAND,A., AND UHLIG, V. K42: Building a Complete Operating Sys-tem. In Proceedings of EuroSys (2006).

[12] LAROWE, R. P., JR., ELLIS, C. S., AND HOLLIDAY, M. A.Evaluation of NUMA Memory Management Through Modelingand Measurements. IEEE Transactions on Parallel and Dis-tributed Systems 3 (1991), 686–701.

[13] LAUDON, J., AND LENOSKI, D. The SGI Origin: a ccNUMAhighly scalable server. In Proceedings of ISCA (1997).

[14] LI, T., BAUMBERGER, D., KOUFATY, D. A., AND HAHN,S. Efficient Operating System Scheduling for Performance-Asymmetric Multi-core Architectures. In Proceedings of Super-computing (2007).

[15] LUK, C.-K., COHN, R., MUTH, R., PATIL, H., KLAUSER, A.,LOWNEY, G., WALLACE, S., REDDI, V. J., AND HAZELWOOD,K. Pin: Building Customized Program Analysis Tools with Dy-namic Instrumentation. In Proceedings of PLDI (2005).

[16] MERKEL, A., STOESS, J., AND BELLOSA, F. Resource-Conscious Scheduling for Energy Efficiency on Multicore Pro-cessors. In Proceedings of EuroSys (2010).

[17] PARSONS, E., GAMSA, B., KRIEGER, O., AND STUMM, M.(De-)Clustering Objects for Multiprocessor System Software. InProceedings of IWOOS (1995).

[18] SCHERMERHORN, L. T. Automatic Page Migration for Linux.

[19] TAM, D., AZIMI, R., AND STUMM, M. Thread Clustering:Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors.In Proceedings of EuroSys (2007).

[20] UNRAU, R. C., KRIEGER, O., GAMSA, B., AND STUMM, M.Hierarchical Clustering: a Structure for Scalable MultiprocessorOperating System Design. J. Supercomput. 9, 1-2 (1995), 105–134.

[21] XIE, Y., AND LOH, G. Dynamic Classification of ProgramMemory Behaviors in CMPs. In Proceedings of CMP-MSI(2008).

[22] ZHANG, E. Z., JIANG, Y., AND SHEN, X. Does cache shar-ing on modern cmp matter to the performance of contemporarymultithreaded programs? In Proceedings of PPOPP (2010).

[23] ZHURAVLEV, S., BLAGODUROV, S., AND FEDOROVA, A. Ad-dressing Contention on Multicore Processors via Scheduling. InProceedings of ASPLOS (2010).

15

Documents

A Case for NUMA-aware Contention Management on Multicore Systems · 2019-02-25 · A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov Simon Fraser