12
OpenMP Task Scheduling Strategies for Multicore NUMA Systems Stephen L. Olivier University of North Carolina at Chapel Hill Campus Box 3175 Chapel Hill, NC 27599, USA [email protected] Allan K. Porterfield Renaissance Computing Institute (RENCI) 100 Europa Drive, Suite 540 Chapel Hill, NC 27517, USA [email protected] Kyle B. Wheeler Department 1423: Scalable System Software Sandia National Laboratories Albuquerque, NM 87185, USA [email protected] Michael Spiegel Renaissance Computing Institute (RENCI) 100 Europa Drive, Suite 540 Chapel Hill, NC 27517, USA [email protected] Jan F. Prins University of North Carolina at Chapel Hill Campus Box 3175 Chapel Hill, NC 27599, USA [email protected] ABSTRACT The recent addition of task parallelism to the OpenMP shared mem- ory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel exe- cution on the OpenMP run time system. Ecient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory ac- cess (NUMA) characteristics. In this paper, we propose a hierarchi- cal scheduling strategy that leverages dierent scheduling methods at dierent levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploita- tion of cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In order to evalu- ate scheduling strategies, we extended the open-source Qthreads threading library to implement our schedulers, accepting OpenMP programs through the ROSE compiler. We present a comprehensive performance study of diverse OpenMP task parallel benchmarks, comparing seven dierent task parallel run time scheduler implementations on an Intel Nehalem multi- socket multicore system: our hierarchical work stealing scheduler, a fully-distributed work stealing scheduler, a centralized sched- uler, and LIFO and FIFO versions of the Qthreads fully-distributed scheduler. In addition, we compare our results against the Intel and GNU OpenMP implementations. Hierarchical scheduling in Qthreads is competitive on all benchmarks tested. On five of the seven benchmarks, hierarchical scheduling in Qthreads demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP run time systems. Our run time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix systems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix. Keywords Task parallelism, Run time systems, Work stealing, Scheduling, Multicore, OpenMP 1. INTRODUCTION Task parallel programming models oer a simple way for appli- cation programmers to specify parallel tasks in a problem-centric form that easily scales with problem size, leaving the scheduling of these tasks onto processors to be performed at run-time. Task parallelism is well suited to the expression of nested parallelism in recursive divide-and-conquer algorithms and of unstructured paral- lelism in irregular computations. An ecient task scheduler must meet challenging and some- times conflicting goals: exploit cache and memory locality, main- tain load balance, and minimize overhead costs. When there is an inequitable distribution of work among processors, load imbalance arises. Without redistribution of work, some processors become idle. Load balancing operations, when successful, redistribute the work more equitably across processors. However, load balancing operations can also contribute to overhead costs. Load balancing operations between sockets increase memory access time due to more cold cache misses and more high-latency remote memory ac- cesses. This paper proposes an approach to mitigate these issues and advances understanding of their impact through the following contributions: 1. A hierarchical scheduling strategy targeting modern multi- socket multicore shared memory systems whose NUMA architecture is not well supported by either fully-distributed or centralized schedulers. Our approach combines work steal- ing and shared queues for low overhead load balancing and exploitation of shared caches. 2. A detailed performance study on a current generation multi-socket multicore Intel system. Seven run time imple- mentations supporting task parallel OpenMP programs are compared: five schedulers that we added to the open-source Qthreads library, the GNU GCC OpenMP run time, and the Intel OpenMP run time. In addition to speedup results demon- strating superior performance by our run time on many of the diverse benchmarks tested, we examine several secondary metrics that illustrate the benefits of hierarchical scheduling over fully-distributed work stealing. 3. Additional performance evaluations on a two-socket mul- ticore AMD system and a 192-processor SGI Altix. These

OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

OpenMP Task Scheduling Strategiesfor Multicore NUMA Systems

Stephen L. OlivierUniversity of North Carolina

at Chapel HillCampus Box 3175

Chapel Hill, NC 27599, [email protected]

Allan K. PorterfieldRenaissance Computing

Institute (RENCI)100 Europa Drive, Suite 540Chapel Hill, NC 27517, USA

[email protected]

Kyle B. WheelerDepartment 1423:

Scalable System SoftwareSandia National LaboratoriesAlbuquerque, NM 87185, USA

[email protected] Spiegel

Renaissance ComputingInstitute (RENCI)

100 Europa Drive, Suite 540Chapel Hill, NC 27517, USA

[email protected]

Jan F. PrinsUniversity of North Carolina

at Chapel HillCampus Box 3175

Chapel Hill, NC 27599, [email protected]

ABSTRACTThe recent addition of task parallelism to the OpenMP shared mem-ory API allows programmers to express concurrency at a high levelof abstraction and places the burden of scheduling parallel exe-cution on the OpenMP run time system. Efficient scheduling oftasks on modern multi-socket multicore shared memory systemsrequires careful consideration of an increasingly complex memoryhierarchy, including shared caches and non-uniform memory ac-cess (NUMA) characteristics. In this paper, we propose a hierarchi-cal scheduling strategy that leverages different scheduling methodsat different levels of the hierarchy. By allowing one thread to stealwork on behalf of all of the threads within a single chip that sharea cache, our scheduler limits the number of costly remote steals.For cores on the same chip, a shared LIFO queue allows exploita-tion of cache locality between sibling tasks as well as between aparent task and its newly created child tasks. In order to evalu-ate scheduling strategies, we extended the open-source Qthreadsthreading library to implement our schedulers, accepting OpenMPprograms through the ROSE compiler.

We present a comprehensive performance study of diverse OpenMPtask parallel benchmarks, comparing seven different task parallelrun time scheduler implementations on an Intel Nehalem multi-socket multicore system: our hierarchical work stealing scheduler,a fully-distributed work stealing scheduler, a centralized sched-uler, and LIFO and FIFO versions of the Qthreads fully-distributedscheduler. In addition, we compare our results against the Inteland GNU OpenMP implementations. Hierarchical scheduling inQthreads is competitive on all benchmarks tested. On five of theseven benchmarks, hierarchical scheduling in Qthreads demonstratesspeedup and absolute performance superior to both the Intel andGNU OpenMP run time systems. Our run time also demonstratessimilar performance benefits on AMD Magny Cours and SGI Altixsystems, enabling several benchmarks to successfully scale to 192CPUs of an SGI Altix.

KeywordsTask parallelism, Run time systems, Work stealing, Scheduling,Multicore, OpenMP

1. INTRODUCTIONTask parallel programming models offer a simple way for appli-

cation programmers to specify parallel tasks in a problem-centricform that easily scales with problem size, leaving the schedulingof these tasks onto processors to be performed at run-time. Taskparallelism is well suited to the expression of nested parallelism inrecursive divide-and-conquer algorithms and of unstructured paral-lelism in irregular computations.

An efficient task scheduler must meet challenging and some-times conflicting goals: exploit cache and memory locality, main-tain load balance, and minimize overhead costs. When there is aninequitable distribution of work among processors, load imbalancearises. Without redistribution of work, some processors becomeidle. Load balancing operations, when successful, redistribute thework more equitably across processors. However, load balancingoperations can also contribute to overhead costs. Load balancingoperations between sockets increase memory access time due tomore cold cache misses and more high-latency remote memory ac-cesses. This paper proposes an approach to mitigate these issuesand advances understanding of their impact through the followingcontributions:

1. A hierarchical scheduling strategy targeting modern multi-socket multicore shared memory systems whose NUMAarchitecture is not well supported by either fully-distributedor centralized schedulers. Our approach combines work steal-ing and shared queues for low overhead load balancing andexploitation of shared caches.

2. A detailed performance study on a current generationmulti-socket multicore Intel system. Seven run time imple-mentations supporting task parallel OpenMP programs arecompared: five schedulers that we added to the open-sourceQthreads library, the GNU GCC OpenMP run time, and theIntel OpenMP run time. In addition to speedup results demon-strating superior performance by our run time on many of thediverse benchmarks tested, we examine several secondarymetrics that illustrate the benefits of hierarchical schedulingover fully-distributed work stealing.

3. Additional performance evaluations on a two-socket mul-ticore AMD system and a 192-processor SGI Altix. These

Page 2: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

evaluations demonstrate performance portability and scala-bility of our run time implementations.

This paper extends work originally presented in [26]. The re-mainder of the paper is organized as follows: Section 2 providesrelevant background information, Section 3 describes existing taskscheduler designs and our hierarchical approach, Section 4 presentsthe results of our experimental evaluation, and Section 5 discussesrelated work. We conclude in Section 6 with some final observa-tions.

2. BACKGROUNDBroadly supported by both commercial and open-source com-

pilers, OpenMP allows incremental parallelization of serial pro-grams for execution on shared memory parallel computers. Ver-sion 3.0 of the OpenMP specification for FORTRAN and C/C++

adds explicit task parallelism to complement its existing data par-allel constructs [27, 3]. The OpenMP task construct generates atask from a statement or structured block. Task synchronizationis provided by the taskwait construct, and the semantics of theOpenMP barrier construct have also been overloaded to requirecompletion of all outstanding tasks.

Execution of OpenMP programs combines the efforts of the com-piler and an OpenMP run time library. Intel and GCC both haveintegrated OpenMP compiler and run time implementations. Usingthe ROSE compiler [24], we have created an equivalent method tocompile and run OpenMP programs with the Qthreads [32] library.The ROSE compiler is a source-to-source translator that supportsOpenMP 3.0 with a simple compiler flag. In one compile step,it produces an intermediate C++ file and invokes the GNU C++

compiler to compile that file with additional libraries to producean executable. ROSE performs syntactic and semantic analysis onOpenMP directives, transforming them into run time library callsin the intermediate program. The ROSE common OpenMP runtime library (XOMP) maps the run time calls to functions in theQthreads library.

2.1 QthreadsQthreads [32] is a cross-platform general-purpose parallel run

time library designed to support lightweight threading and synchro-nization in a flexible integrated locality framework. Qthreads di-rectly supports programming with lightweight threads and a varietyof synchronization methods, including non-blocking atomic opera-tions and potentially blocking full/empty bit (FEB) operations. TheQthreads lightweight threading concept and its implementation areintended to match future hardware environments by providing effi-cient software support for massive multithreading.

In the Qthreads execution model, lightweight threads (qthreads)are created in user-space with a small context and small fixed-sizestack. Unlike heavyweight threads such as pthreads, qthreads donot support expensive features like per-thread identifiers, per-threadsignal vectors, or preemptive multitasking. Qthreads are sched-uled onto a small set of worker pthreads. Logically, a qthread isthe smallest schedulable unit of work, such as a set of loop itera-tions or an OpenMP task, and a program execution generates manymore qthreads than it has worker pthreads. Each worker pthreadis pinned to a processor core and assigned to a locality domain,termed a shepherd. Whereas Qthreads previously allowed onlyone worker pthread per shepherd, we added support for multipleworker pthreads per shepherd. This support enables us to mapshepherds to different architectural components, e.g., one shepherdper core, one shepherd per shared L3 cache, or one shepherd perprocessor socket.

The default scheduler in the Qthreads run time uses a cooperative-multitasking approach. When qthreads block, e.g., performing anFEB operation, a context switch is triggered. Because this contextswitch is done in user space via function calls and requires neithersignals nor saving a full set of registers, it is less expensive than anoperating system or interrupt-based context switch. This techniqueallows qthreads to execute uninterrupted until data is needed that isnot yet available, and allows the scheduler to attempt to hide com-munication latency by switching to other qthreads. Logically, thisonly hides communication latencies that take longer than a contextswitch.

The Qthreads API includes several threaded loop interfaces, builton top of the core threading components. The API provides threebasic parallel loop behaviors: one to create a separate qthread foreach iteration, one that divides the iterations space evenly amongall shepherds, and one that uses a queue-like structure to distributesub-ranges of the iteration space to enable self-scheduled loops. Weused the Qthreads queueing implementation as a starting point forour scheduling work.

We added support for the ROSE XOMP calls to Qthreads al-lowing it to be used as the run time for OpenMP programs. Al-though Qthreads XOMP/OpenMP support is not fully complete,it accepts every OpenMP program accepted by ROSE. We imple-ment OpenMP threads as worker pthreads. Unlike many OpenMPimplementations, default loop scheduling is self-guided rather thanstatic, though the latter can be explicitly requested. For task par-allelism, we implement each OpenMP task as a qthread. (We usethe term task rather than qthread throughout the remainder of thepaper, both for simplicity and and because the scheduling conceptswe explore are applicable to other task parallel languages and li-braries.) We used the Qthreads FEB synchronization mechanismas a base layer upon which to implement taskwait and barriersychronization.

3. TASK SCHEDULER DESIGNThe stock Qthreads scheduler, called Q in Section 4, was en-

gineered for parallel loop computation. Each processor executeschunks of loop iterations packaged as qthreads. Round robin dis-tribution of the iterations among the shepherds and self-schedulingare used in combination to maintain load balance. A simple lock-free per-shepherd FIFO queue stores iterations as they wait to beexecuted.

Task parallel programs generate a dynamically unfolding sequenceof interdependent tasks, often represented by a directed acyclicgraph (DAG). A task executing on the same thread as its parentor sibling tasks may benefit from temporal locality if they operateon the same data. In particular, such locality properties are a featureof divide-and-conquer algorithms. To efficiently schedule tasks aslightweight threads in Qthreads, the run time must support moregeneral dynamic load balancing while exploiting available localityamong tasks. We implemented a modified Qthreads scheduler, L,to use LIFO rather than FIFO queues at each shepherd to improvethe use of locality. However, the round robin distribution of tasksbetween shepherds does not provide fully dynamic load balancing.

3.1 Work Stealing & Centralized SchedulersTo better meet the dual goals of locality and load balance, we

implemented work stealing. Blumofe et al. proved that work steal-ing is optimal for multithreaded scheduling of DAGs with mini-mal overhead costs [7], and they implemented it in their Cilk runtime scheduler [6]. Our initial implementation of work stealing inQthreads, WS, mimics Cilk’s scheduling discipline: Each shepherdschedules tasks depth-first locally through LIFO queue operations.

Page 3: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

Qthreads Implementations, compiled Rose/GCC -O2 -gVersion Scheduler Number of Task Internal ExternalName Implementation Shepherds Placement Queue Access Queue Access

Q Stock one per core round robin FIFO (non-blocking) noneL LIFO one per core round robin LIFO (blocking) none

CQ Centralized Queue one N/A LIFO (blocking) N/AWS Work Stealing one per core local LIFO (blocking) FIFO stealing

MTS Multi-Threaded Shepherds one per chip local LIFO (blocking) FIFO stealingICC Intel 11.1 OpenMP, compiled -O2 -xHost -ipo -gGCC GCC 4.4.4 OpenMP, compiled -O2 -g

Table 1: Scheduler implementations evaluated: five Qthreads implementations, ICC, and GCC.

An idle shepherd obtains more work by stealing the oldest tasksfrom the task queue of a busy shepherd. We implemented two dif-ferent probing schemes to find a victim shepherd, observing equiv-alent performance: choosing randomly and commencing search atthe nearest shepherd ID to the thief. In the work stealing scheduler,interruptions to busy shepherds are minimized because the burdenof load balancing is placed on the idle shepherds. Locality is pre-served because newer tasks, whose data is still hot in the proces-sor’s cache, are the first to be scheduled locally and the last in lineto be stolen.

The cost of work stealing operations on multi-socket multicoresystems varies significantly based on the relative locations of thethief and victim, e.g., whether they are running on cores on thesame chip or on different chips. Stealing between cores on differentchips reduces performance by incurring higher overhead costs, ad-ditional cold cache misses, remote locking, remote memory accesscosts, and coherence misses due to false sharing. Another limita-tion of work stealing is that it does not make the best possible useof caches shared among cores. In contrast, Chen et al. [12] showedthat a depth-first schedule close to serial order makes better useof a shared cache than work stealing, assuming serial execution ofan application makes good use of the cache. Blelloch et al. hadshown that such a schedule can be achieved using a shared LIFOqueue [5]. We implemented a centralized shared LIFO queue, CQ,for Qthreads, but it is a poor match for multi-socket multicore sys-tems since not all cores, but only cores on the same chip, share thesame cache. Moreover, the centralized queue implementation is notscalable, as contention drives up the overhead costs.

3.2 Hierarchical SchedulingTo overcome the limitations of both work stealing and shared

queues, we developed a hierarchical approach: multithreaded shep-herds, MTS. We create one shepherd for all the cores on the samechip. These cores share a cache, typically L3, and all are proxi-mal to a local memory attached to that socket. Within each shep-herd, we map one pthread worker to each core. Among workers ineach shepherd, a shared LIFO queue provides depth-first schedul-ing close to serial order to exploit the shared cache. Thus, loadbalancing happens naturally among the workers on a chip and con-current tasks have possible overlapping localities that can be cap-tured in the shared cache.

Between shepherds work stealing is used to maintain load bal-ance. Each time the shepherd’s task queue becomes empty, only thefirst worker to find the queue empty steals enough tasks (if avail-able) from another shepherd’s queue to supply all the workers inits shepherd with work. The other workers in the shepherd spinuntil the stolen work appears. Centralized task queueing for work-ers within each shepherd reduces the need for remote stealing byproviding local load balance. By allowing only one representative

6!The University of North Carolina at Chapel Hill !

Typical SMP System Layout!

2!

5!6!

3!

4!

L3 Cache!M

em!

Mem

!

Mem

!M

em!

1!0!

7!

2!

5!6!

3!

4!

L3 Cache!

Mem

!M

em!

Mem

!M

em!

1!0

7!

1!

6! 5!

0!

7!

L3 Cache!

Mem

!M

em!

Mem

!M

em!

2! 3!

4!

1!

6! 5!

0!

7!

L3 Cache!

Mem

!M

em!

Mem

!M

em!

2! 3!

4!

Figure 2: Topology of the 4-socket Intel Nehalem.

worker to steal at time, in bulk for all workers in the shepherd, com-munication overheads are reduced. While a shared queue can be aperformance bottleneck, the number of cores per chip is bounded,and intra-chip locking operations are fast.

4. EVALUATIONTo evaluate the performance of our hierarchical scheduler and

the other Qthreads schedulers, we present results from the BarcelonaOpenMP Tasks Suite (BOTS), version 1.1, available online [15].The suite comprises a set of task parallel applications from vari-ous domains with varying computational characteristics [16]. Ourexperiments used the following benchmark components and inputs:

• Alignment: Aligns sequences of proteins using dynamic pro-gramming (100 sequences)

• Fib: Computes the nth Fibonacci number using brute-forcerecursion (n = 50)

• Health: Simulates a national health care system over a seriesof timesteps (144 cities)

• NQueens: Finds solutions of the n-queens problem usingbacktrack search (n = 14)

• Sort: Sorts a vector using parallel mergesort with sequentialquicksort and insertion sort (128M integers)

• SparseLU: Computes the LU factorization of a sparse matrix(10000 × 10000 matrix, 100 × 100 submatrix blocks)

• Strassen: Computes a dense matrix multiply using Strassen’smethod (8192 x 8192 matrix)

For the Fib, Health, and NQueens benchmarks, the default man-ual cut-off configurations provided in BOTS are enabled to prune

Page 4: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

#pragma omp singlefor (si = 0; si < nseqs; si++)for (sj = i+1; sj < nseqs; sj++)#pragma omp task firstprivate(si, sj)compare(seq[si], seq[sj]);

#pragma omp for schedule(dynamic)for (si = 0; si < nseqs; si++)for (sj = si+1; sj < nseqs; sj++)#pragma omp task firstprivate(si, sj)compare(seq[si], seq[sj]);

Figure 1: Simplified code for the two versions of Alignment: single (left) and for (right).

Configuration Alignment Fib Health NQueens Sort SparseLU StrassenICC -O2 -xHost -ipo Serial 28.33 100.4 15.07 49.35 20.14 117.3 169.3

GCC -O2 Serial 28.06 83.46 15.31 45.24 19.83 119.7 162.7ICC 32 threads 0.9110 4.036 1.670 1.793 1.230 7.901 10.13GCC 32 threads 0.9973 5.283 7.460 1.766 1.204 4.517 10.13

Qthreads MTS 32 workers 1.024 3.189 1.122 1.591 1.080 4.530 10.72

Table 2: Sequential and parallel execution times using ICC, GCC, and the Qthreads MTS scheduler (time in sec.). For Alignmentand SparseLU, the best time between the two parallel variations (single and for) is shown.

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 3: Health on 4-socket Intel Nehalem

the generation of tasks below a prescribed point in the task hier-archy. For Sort, cutoffs are set to transition at 32K integers fromparallel mergesort to sequential quicksort and from parallel mergetasks to sequential merge calls. For Strassen, the cut-off giving thebest performance for each implementation is used. Other BOTSbenchmarks are not presented here: UTS and FFT use very fine-grained tasks without cutoffs, yielding poor performance on all runtimes, and floorplan raises compilation issues in ROSE.

For both the Alignment and SparseLU benchmarks, BOTS pro-vides two different source files. Simplified code given in Figure 1illustrates the distinction between the two versions of Alignment. Inthe first (Alignment-single) the loop nest that generates the tasksis executed sequentially by a single thread. This version createsonly task parallelism. In the second (Alignment-for) the outer loopis executed in parallel, creating both loop-level parallelism and taskparallelism. Likewise, the two versions of SparseLU are one inwhich in which tasks are generated within single-threaded loop ex-ecutions and another in which tasks are generated within parallelloop executions.

We ran the battery of tests on seven scheduler implementations:five versions of Qthreads1, the GNU GCC OpenMP implementa-tion [17], and the Intel ICC OpenMP implementation, as summa-rized in Table 1. The Qthreads implementations are as follows.

• Q is the original version of Qthreads and defines each coreto be a separate locality domain or shepherd. It uses a non-

1all compiled with GCC 4.4.4 -O2

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 4: Sort on 4-socket Intel Nehalem

blocking FIFO queue to schedule tasks within each shepherd(individual core). Each shepherd only obtains tasks from itslocal queue, although tasks are distributed across shepherdson a round robin basis for load balance.

• L incorporates a simple double ended locking LIFO queue toreplace the original non-blocking FIFO queue. Concurrentaccess at both ends is required for work stealing, though Lretains round robin task distribution for load balance ratherthan work stealing.

• CQ uses a single shepherd and centralized shared queue todistribute tasks among all of the cores in the system. Thisshould provide adequate load balance, but contention for thequeue limits scalability as task size shrinks.

• WS provides a shepherd (and individual queue) for each core,and idle shepherds steal tasks from the shepherds runningon the other cores. Initial task placement is not round robinbetween queues, but onto the local queue of the shepherdwhere it is generated, exploiting locality among related tasks.

• MTS assigns one shepherd to every processor memory lo-cality (shared L3 cache on chip and attached DIMMs). Eachcore on a chip hosts a worker thread that shares its shepherd’squeue. Only one core is allowed to actively steal tasks on be-half of the queue at a time and tasks are stolen in chunks largeenough (tunable) to keep all of the cores busy.

Page 5: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

4.1 Overall Performance on Intel NehalemThe first hardware test system for our experiments is a Dell Pow-

erEdge M910 quad-socket blade with four Intel x7550 2.0GHz 8-core Nehalem-EX processors installed for a total of 32 cores. Theprocessors are fully connected using Intel QuickPath Interconnect(QPI) links, as shown in Figure 2. Each processor has an 18MBshared L3 cache and each core has a private 256KB L2 cache aswell as 32KB L1 data and instruction caches. The blade has 64dual-rank 2GB DDR3 memory sticks (16 per processor chip) fora total of 132GB. It runs CentOS Linux with a 2.6.35 kernel. Al-though the x7550 processor supports HyperThreading (Intel’s si-multaneous multithreading technology), we pinned only one threadto each physical core for our experiments.

All executables using the Qthreads and GCC run times werecompiled with GCC 4.4.4 with -g and -O2 optimization, for con-sistency. Executables using the Intel run time were compiled withICC 11.1 and -O2 -xHost -ipo optimization. Reported results arefrom the best of ten runs.

Overall the GCC compiler and ICC compiler produce executa-bles with similar serial performance, as shown in Table 2. Theseserial execution times provide a basis for us to compare the relativespeedup of the various benchmarks. If the -ipo and -xHost flags arenot used with ICC on SparseLU, the GCC serial executable runs3x faster than the ICC executable compiled with -O2 alone. Thesignificance of this difference will be clearer in the presentation ofparallel performance on SparseLU in Section 4.2. Several otherbenchmarks also run slower with those ICC flags omitted, thoughnot by such a large margin.

Qthreads MTS 32 core performance is faster or comparable tothe performance of ICC and GCC. In absolute execution time, MTSruns faster than ICC for 5 of the 7 benchmarks by up to 74.4%. Itis over 6.6x faster for one benchmark than GCC and up to 65.6%faster on 4 of the 6 others. On two benchmarks MTS runs slower:for Alignment, it is 12.4% slower than ICC and 2.7% slower thanGCC and for Strassen it is 5.8% slower than both (although WSequaled GCC’s performance [see discussion on Strassen in sec. 4.2]).Thus even as a research prototype, ROSE/Qthreads provides com-petitive OpenMP task execution.

4.2 Individual Performance on Intel NehalemIndividual benchmark performance on multiple implementations

of the OpenMP run time demonstrates features of particular appli-cations where Qthreads generates better scheduling and where itneeds further development. Examining where the run times differin achieved speedup reveals the strengths and weaknesses of eachscheduling approach.

The Health benchmark, Figure 3, shows significant diversity inperformance and speedup. GNU performance is slightly superlin-ear for 4 cores (4.5x), but peaks with only 8 cores active (6.3x)and by 32 cores the speedup is only 2x. Intel also has scaling is-sues and performance flattens to 9x at 16 cores. Stock Qthreads Qscales slightly better (9.4x), but just switching to the LIFO queueL to improve locality between tasks allows speedup on 32 cores toreach 11.5x. Since the individual tasks are relatively small, CQ ex-periences contention on its task queue that limits speedup to 7.7xon 16 cores, with performance degrading to 6.1x at 32 cores. Whenwork stealing, WS, is added to Qthreads the performance improvesslightly and speedup reaches 11.6x. MTS further improves localityand load balance on each processor by sharing a queue across thecores on each chip, and speedup increases to 13.6x on 32 cores.This additional scalability allows Qthread MTS a 17.3% faster exe-cution time on 32 cores than any other implementation, much fasterthan ICC (48.7%) and GCC(116.1%). Health provides an excellent

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 5: NQueens on 4-socket Intel Nehalem

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 6: Fib on 4-socket Intel Nehalem

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 7: Alignment-single on 4-socket Intel Nehalem

example of how both work stealing and queue sharing within a sys-tem can independently and together improve performance, thoughthe failure of any run time to reach 50% efficiency on 32 coresshows that there is room for improvement.

The benefits of hierarchical scheduling can also be seen in Fig-ure 4. Sort, for which we used a manual cutoff of 32K integersto switch between parallel and serial sorts, achieved speed up ofabout 16x for 32 cores on ICC and GCC, but just 11.4x for the baseversion of Qthreads, Q. The switch to a LIFO queue, L, improvedspeedup to 13.6x by facilitating data sharing between a parent andchild. Independent changes to add work stealing, WS, and improveload balance, CQ, both improved speedup to 16x. By combining

Page 6: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 8: Alignment-for on 4-socket Intel Nehalem

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 9: SparseLU-single on 4-socket Intel Nehalem

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC  

Figure 10: SparseLU-for on 4-socket Intel Nehalem

the best features of both work stealing and multiple threads sharinga queue, MTS increased speedup to 18.4x and achieved an 13.8%and 11.4% reduction in overall execution time compared to ICCand GCC OpenMP versions.

Locality effects allow NQueens to achieve slightly super-linearspeedup for 4 and 8 cores using Qthreads. As seen in Figure 5,speedup is near-linear for 16 threads and only somewhat sub-linearfor 32 threads on all OpenMP implementations. By adding loadbalancing mechanisms to Qthreads, its speedup improved signifi-cantly (24.3x to 28.4x). CQ and WS both improved load balancebeyond what the LIFO queue (L) provides and little is gained bycombining them together in MTS. The additional scaling of these

0  

4  

8  

12  

16  

20  

24  

28  

32  

4   8   16   32  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   L   Q   ICC   GCC  

Figure 11: Strassen on 4-socket Intel Nehalem

three versions results in a execution time 12.6% faster than ICC and10.9% faster than GCC.

Fib, Figure 6, uses a cut-off to stop the creation of very smalltasks, and thus has enough work in each task to amortize the costsof queue access. CQ yields performance 2-3% faster than MTSand the other versions of Qthreads, since load balance is good andno time is spent looking for work. The load balancing versions ofQthreads (26.1x - 26.7x) scale better than Intel at 24.9x. Both sys-tems beat GCC substantially at only 15.8x. Overall, the schedulingimprovements resulted in MTS running 26.5% faster than ICC and28.8% faster than GCC but 2.0% slower than CQ.

The next two applications Alignment and SparseLU, each havetwo versions. For Alignment, Figures 7 and 8, speedup was near-linear for all versions and execution times between GCC and Qthreadswere close (GCC +2.7% single initial task version; Qthreads +0.5%parallel loop version). ICC scales better than GCC or QthreadsMTS, WS, CQ, with 12.4% lower execution time. Since Alignmenthas no taskwait synchronizations, we speculate that ICC scalesbetter on this benchmark because it maintains fewer bookkeepingdata structures in the absence of synchronization.

On both SparseLU versions, ICC serial performance improvednearly 3x using the -ipo and -xHost flags rather than using -O2alone. The flags also improved parallel performance, but by only60%, so the improvement does not scale linearly. On SparseLU-single, Figure 9, the performance of GCC and the various Qthreadsversions is effectively equivalent, with speedup reaching 26.2x. Dueto the aforementioned scaling issues, ICC speedup reaches only14.8x. The execution times differ by 0.3% between GCC and MTSwith both about 74.4% faster than ICC. On SparseLU-for, Fig-ure 10, the GCC OpenMP runs were stopped after exceeding thesequential time; thus data is not reported. ICC again scales poorly(14.8x), and Qthreads speedup improves due to the LIFO workqueue and work stealing, reaching 22.2x. MTS execution time is46.3% faster than ICC.

Strassen, Figure 11, performs recursive matrix multiplication us-ing Strassen’s method and is challenging for implementations withmultiple workers accessing a queue. We used the cutoff setting thatgave the best performance for each implementation: coarser (128)for CQ and MTS and the default setting (64) for the others. Theexecution times of GCC, and WS are within 1% of each other on32 cores, and Intel scales slightly better (16.7x vs 16.1x). For MTS,in which only 8 threads share a queue (rather than 32 as in CQ)the speedup reaches 15.2x. For CQ, however, the performance hitdue to queue contention is substantial, as speedup peaks at 9.7x. Qperformance suffers from the FIFO ordering: not enough parallelwork is expressed at any one time, and speedup never exceeds 4x.

Page 7: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

Configuration Alignment Alignment Fib Health NQueens Sort SparseLU SparseLU Strassen(single) (for) (single) (for)

ICC 32 threads 4.4 2.0 3.7 2.0 3.2 4.0 1.1 3.9 1.8GCC 32 threads 0.11 0.34 2.8 0.35 0.77 1.8 0.49 N/A 1.4

Qthreads MTS 32 workers 0.28 1.5 3.3 1.3 0.78 1.9 0.15 0.16 1.9Qthreads WS 32 shepherds 0.035 1.8 2.0 0.29 0.60 0.90 0.060 0.24 3.0

Table 3: Variability in performance on 4-socket Intel Nehalem using ICC, GCC, MTS, and WS schedulers (standard deviation as apercent of the fastest time).

Benchmark MTS WSSteals Failed Steals Failed

Alignment (single) 1016 88 3695 255Alignment (for) 109 122 1431 286

Fib 633 331 467 984Health 28948 10323 295637 47538

NQueens 102 141 1428 389Sort 1134 404 19330 3283

SparseLU (single) 18045 8133 68927 24506SparseLU (for) 13486 11889 68099 32205

Strassen 227 157 14042 823

Table 5: Number of remote steal operations during execution ofHealth and Sort by Qthreads MTS&WS schedulers. In a failedsteal, the thief acquires the lock on the victim’s queue after apositive probe for work but ultimately finds no work availablefor stealing. On-chip steals performed by the WS scheduler areexcluded. Average of ten runs.

Metric MTS WS %DiffL3 Misses 1.16e+06 2.58e+06 38

Bytes from Memory 8.23e+09 9.21e+09 5.6Bytes on QPI 2.63e+10 2.98e+10 6.2

Table 6: Memory performance data for Health using MTS andWS. Average of ten runs on 4-socket Intel Nehalem.

4.3 VariabilityOne interesting feature of a work stealing run time is an idle

thread’s ability to search for work and the effect this has on perfor-mance in regions of limited parallelism or load imbalance. Table 3gives the standard deviation of 10 runs as a percent of the fastesttime for each configuration tested with 32 threads. Both Qthreadsimplementations with work stealing (WS and MTS) have very smallvariation in execution time for 3 of the 9 programs. For 8 of the 9benchmarks, both WS and MTS show less variability than ICC.

In three cases (Alignment-single, Health, SparseLU-single),Qthreads WS variability was much lower than MTS. Since MTS en-ables only one worker thread per shepherd at a time to steal a chunkof tasks, it is reasonable to expect this granularity to be reflectedin execution time variations. Overall, we see less variability withWS than MTS in 6 of the 9 benchmarks. We speculate that nor-mally having all the threads looking for work leads to finding thelast work quickest and therefore less variation in total executiontime. However, for some programs (Alignment-for, SparseLU-for, Strassen), stealing multiple tasks and moving them to an idleshepherd results in faster execution during periods of limited par-allelism. WS also shows less variability than GCC in 6 of the 8programs for which we have data. There is no data for SparseLU-for on GCC, as explained in the previous section.

0.9  

1  

1.1  

1.2  

1   2   4   8   16   32   64  Performan

ce  re

la,ve  

 to  chu

nk  size  1  

Chunk  size  (number  of  tasks  stolen    per  steal  opera,on)  

Figure 12: Performance on Health using MTS based on choiceof the chunk size for stealing. Average of ten runs on 4-socketIntel Nehalem.

Metric MTS WS %DiffL3 Misses 1.03e+7 3.42e+07 54

Bytes from Memory 2.27e+10 2.53e+10 5.5Bytes on QPI 4.35e+10 4.87e+10 5.6

Table 7: Memory performance data for Sort using MTS andWS. Average of ten runs on 4-socket Intel Nehalem.

4.4 Performance Analysis of MTSLimiting the number of inter-chip load balancing operations is

central to the design of our hierarchical scheduler (MTS). Considerthe number of remote (off-chip) steal operations performed by MTSand by the flat work stealing scheduler WS, shown in Table 5. Thesecounts exclude the number of on-chip steals performed by WS, andrecall that MTS uses work stealing only between chips. We observethat WS steals more than MTS in almost all cases, and some casesby an order of magnitude. Health and Sort are two benchmarkswhere MTS wins clearly in terms of speedup. WS steals remotelyover twice as many times as MTS on Sort and nearly twice as manytimes as MTS on Health. The number of failed steals is also signif-icantly higher with WS than with MTS. A failed steal occurs whena thief’s lock-free probe of a victim indicates that work is availablebut upon acquisition of the lock to the victim’s queue the thief findsno work to steal because another thread has stolen it or the victimhas executed the tasks itself. Thus, both failed and completed stealscontribute to overhead costs.

The MTS scheduler aggregates inter-chip load balancing by per-mitting only one worker at a time to initiate bulk stealing from re-mote shepherds. Figure 12 shows how this improves performanceon Health, one of the benchmarks sensitive to load balancing gran-ularity. If only one task is stolen at time, subsequent steals areneeded to provide all workers with tasks, adding to overhead costs.There are eight cores per socket on our test machine, thus eightworkers per shepherd, and a target of eight tasks stolen per stealrequest. This coincides with the peak performance: When the tar-get number of tasks stolen corresponds to the number of workers

Page 8: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

Alignment Alignment Fib Health NQueens Sort SparseLU SparseLU Strassen(single) (for) (single) (for)

Tasks Stolen 5900 450 2181 159386 423 5214 93117 38198 1355Tasks Per Steal 5.8 4.1 3.4 5.5 4.1 4.6 5.1 2.8 6.0

Table 4: Tasks stolen and tasks per steal using the MTS scheduler. Average of ten runs.

7!The University of North Carolina at Chapel Hill !

Typical SMP System Layout!

2!L3

Cache!

Mem

!

Mem

! 1!0!

3! 3!L3

Cache!

Mem

!

Mem

!0! 1!

2!

2!L3

Cache!

Mem

!

Mem

! 1!0!

3! 3!L3

Cache!M

em!

Mem

!0! 1!

2!

Figure 13: Topology of the 2-socket/4-chip AMD Magny Cours.

in the shepherd, all workers in the shepherd are able to draw workimmediately from the queue as a result of the steal.

Frequently the number of tasks available to steal is less than thetarget number to be stolen. Table 4 shows the total number of tasksstolen and the average number of tasks stolen per steal operation.Across all benchmarks, the range of tasks stolen per steal is 2.8 to6.0. The numbers skew downward due to a scarcity of work duringstart-up and near termination, when only one or few tasks are avail-able at a time. Note the lower number both of total steals and tasksper steal for the for versions of Alignment and SparseLU com-pared to the single versions. Loop parallel initialization providesgood initial load balance so that fewer steals are needed, and thosethat do occur sporadically are near termination and synchronizationphases.

Another benefit of the MTS scheduler is better L3 cache perfor-mance, since all workers in a shepherd share the on-chip L3 cache.The WS scheduler exhibits poorer cache performance, and subse-quently, more reads to main memory. Tables 6 and 7 show therelevant metrics for Health and Sort as measured using hardwareperformance counters, averaged over ten runs. They also showmore traffic on the Quick Path Interconnect (QPI) between chipsfor WS than for MTS. QPI traffic occurs when data is requested andtransferred from either remote memory or remote L3 cache, i.e.,attached to a different socket of the machine. Not only are remoteaccesses higher latency, but they also result in remote cache inval-idations of shared cache lines and subsequent coherence misses.Increased QPI traffic in WS reflects more remote steals and moreaccesses to data in remote L3 caches and remote memory. In sum-mary, MTS gains advantage by exploiting locality among tasks exe-cuted by threads on cores of the same chip, making good use of theshared L3 cache to access memory less frequently and avoid highlatency remote accesses and coherence misses.

4.5 Performance on AMD Magny CoursWe also evaluate the Qthreads schedulers against ICC and GCC

on a 2-socket AMD Magny Cours system, one node of a clusterat Sandia National Laboratories. Each socket hosts an Opteron

6136 multi-chip module: two quad-core chips that share a pack-age connected via two internal HyperTransport (HT) links. Theremaining two HT links per chip are connected to the chips in theother socket, as shown in Figure 13. Each chip contains a memorycontroller with 8GB attached DDR3 memory, a 5MB shared L3cache, and four 2.4 MHz cores with 64kb L1 and 512kb L2 caches.Thus, there are a total of 16 cores and 32GB memory, evenly di-vided among four HyperTransport-connected NUMA nodes (oneper chip, two per socket). The system runs Cray compute-nodeLinux kernel 2.6.27, and we used the GCC 4.6.0 with -O3 opti-mization and ICC 12.0 with -O3 -ipo -msse3 -simd optimization.

We ran the same benchmarks with the same parameters as theIntel Nehalem evaluation. Sequential execution times are reportedin Table 8. Again, interprocedural optimization (-ipo) in ICC wasessential to match the GCC performance; execution time was morethan 500 seconds without it. The greatest remaining difference be-tween the sequential times is on Alignment, where GCC is 20%slower than ICC.

Speedup results using 16 threads are given in Figure 14, forQthreads configurations with one shepherd per chip, MTS (4Q); oneshepherd per socket, MTS (2Q); one shepherd per core (flat workstealing), WS; ICC; and GCC. At least one of the Qthreads variantsmatches or beats ICC and GCC on all but one of the benchmarks.Moreover, the Qthreads schedulers achieve near-linear to slightlysuper-linear speedup on 6 of the 9 benchmarks: the two versionsof Alignment, Fib, NQueens, and the two versions of SparseLU.Of those, speedup using ICC is 22% and 23% lower than Qthreadson the two versions of Alignment, 10% and 18% lower on the twoversions of SparseLU, 9% lower on NQueens and 7% lower on Fib.GCC is 42% lower than Qthreads on Fib, 9% and 27% lower on thetwo versions of SparseLU, and close on NQueens and both versionsof Alignment.

On three of the benchmarks, no run time achieves ideal speedup.Strassen is the only benchmark on which ICC and GCC outperformQthreads, and even ICC falls short of 10X. On Sort, the best per-formance is with Qthreads WS, MTS(4Q), and GCC all at roughly8x. Speedup is lower with Qthreads MTS (2Q) and still lower withCQ, indicating that centralized queueing beyond the chip level iscounterproductive. ICC speedup lags behind the other schedulerson this benchmark. Speedup on Health peaks at 3.3x on this systemusing the Qthreads schedulers, with even worse speedup using ICCand GCC.

The variability in execution times is shown in Table 9. Thestandard deviations for all of the benchmarks on the MTS and WSQthreads implementations are below 2% of the best case executiontime. On all but two of the benchmarks, the MTS standard deviationis less than 1%.

The Magny Cours results demonstrate that the competitive, andin some cases superior, performance of our Qthreads schedulersagainst ICC and GCC is not confined to the Intel architecture. Atfirst glance, differences in performance using the various Qthreadsconfigurations seem less pronounced than they were on the foursocket Intel machine. However, those differences were strongest onthe Intel machine at 32 threads, and the AMD system only has 16threads. Some architectural differences go beyond the difference in

Page 9: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

Configuration Alignment Fib Health NQueens Sort SparseLU StrassenICC 23.93 107.9 10.18 60.56 18.51 156.0 214.9GCC 29.77 105.0 10.67 58.16 17.72 153.4 211.1

Table 8: Sequential execution times using ICC and GCC on the AMD Magny Cours.

0  

2  

4  

6  

8  

10  

12  

14  

16  

Alignment  single  

Alignment  for  

Fib   Health   NQueens   Sort   SparseLU  single  

SparseLU  for  

Strassen  

Speedu

p  

MTS  (4Q)   MTS  (2Q)   WS   CQ   ICC   GCC  

Figure 14: BOTS benchmarks on 2-socket AMD Magny Cours using 16 threads

core count. MTS is designed to leverage locality in shared L3 cache,but the Magny Cours has much less L3 cache per core than the Intelsystem (1.25MB/core versus 2.25MB/core). Less available cachealso accounts for worse performance on the data-intensive Sort andHealth benchmarks.

4.6 Performance on SGI AltixWe evaluate scalability beyond 32 threads on an SGI Altix 3700.

Each of the 96 nodes contains two 1.6MHz Intel Itanium2 proces-sors and 4GB of memory, for a total of 192 processors and 384GBof memory. The nodes are connected by the proprietary SGI NU-MALink4 network and run a single system image of SuSE Linuxkernel 2.6.16. We used the GCC 4.5.2 compiler as the native com-piler for our ROSE-transformed code and the GCC OpenMP runtime for comparison against Qthreads. The version of ICC on thesystem is not recent enough to include support for OpenMP tasks.Sequential execution times, given in Table 10, are slower than thoseof the other machines, because the Itanium2 is an older processor,runs at a lower clock rate, and uses a different instruction set (ia64).

The best observed performance on any of the benchmarks was onNQueens, shown in Figure 15. WS achieves 115x on 128 threads(90% parallel efficiency) and reaches 148x on 192 threads. MTSreaches 134x speedup. (On this machine, the MTS configurationhas two threads per shepherd to match the two processors per NUMAnode.) CQ tops out at 77x speedup on 96 threads, beyond whichoverheads from queue contention become overwhelming. GCCgets up to only 40x speedup. Although no run time achieves linearspeedup on the full machine, they all reach 30x to 32x speedup with32 threads; this underlines the importance of testing at higher pro-cessor counts to evaluate scalability. On the Fib benchmark, shownin Figure 16, MTS almost doubles the performance of CQ and GCCon 192 threads, with a maximum speedup of 97x. CQ peaks at 68x

0  

30  

60  

90  

120  

150  

16   32   64   96   128   192  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   GCC  

Figure 15: NQueens on SGI Altix

speedup on 128 threads and WS exhibits its worst performance rel-ative to MTS, maxing out at 77x speedup on 96 threads.

We see better peak performance on Alignment-for (Figure 18)than Alignment-single (Figure 17). WS reaches 116x speedup on192 threads and MTS reaches 107x, with CQ and GCC performingsignificantly worse. On the other hand, SparseLU-single (Fig-ure 19) scales better than SparseLU-for (Figure 20). Peak speedupon SparseLU-single is 89x with MTS and 86x with WS, whileSparseLU-for achieves a peak speedup of 60x. As was the caseon the 4-socket Intel machine, GCC is unable to complete after atimeout equal to the sequential execution time.

For three of the benchmarks, no improvement in speedup wasobserved beyond 32 threads: Health, Sort, and Strassen. As shownin Figure 21, none exceed 10x speedup on the Altix. These werealso observed to be the most challenging on the four-socket In-

Page 10: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

Configuration Alignment Alignment Fib Health NQueens Sort SparseLU SparseLU Strassen(single) (for) (single) (for)

ICC 2.2 0.80 1.3 14 1.1 8.2 0.62 0.31 2.5GCC 0.035 0.27 5.4 0.38 0.96 3.5 0.016 0.025 1.1

Qthreads MTS (4Q) 0.25 0.63 1.5 0.17 0.13 1.1 0.012 0.16 0.98Qthreads MTS (2Q) 0.46 0.68 1.4 0.069 0.24 0.30 0.015 0.081 0.87

Qthreads WS 0.21 1.3 1.5 0.15 0.13 1.8 0.036 0.094 1.4

Table 9: Variability in performance on AMD Magny Cours using 16 threads (standard deviation as a percent of the fastest time).

Configuration Alignment Fib Health NQueens Sort SparseLU StrassenGCC 53.96 139.2 45.60 63.62 33.59 632.7 551.3

Table 10: Sequential execution times on the SGI Altix.

0  

30  

60  

90  

120  

150  

16   32   64   96   128   192  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   GCC  

Figure 16: Fib on SGI Altix

0  

30  

60  

90  

120  

150  

16   32   64   96   128   192  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   GCC  

Figure 17: Alignment-single on SGI Altix

tel and two-socket AMD systems. Health and Sort are the mostdata-intensive and require new strategies to achieve performanceimprovement, an important area of research going forward.

5. RELATED WORKMany theoretical and practical issues of task parallel languages

and their run time implementations were explored during the de-velopment of earlier task parallel programming models, both hard-ware supported, e.g., Tera MTA [1], and software supported, e.g.,Cilk [6, 18]. Much of our practical reasoning was influenced by ex-perience with the Tera MTA run time, designed for massive multi-threading and low-overhead thread synchronization. Cilk schedul-

0  

30  

60  

90  

120  

150  

16   32   64   96   128   192  Speedu

p  

Number  of  Threads  

MTS   WS   CQ   GCC  

Figure 18: Alignment-for on SGI Altix

0  

30  

60  

90  

120  

150  

16   32   64   96   128   192  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   GCC  

Figure 19: SparseLU-single on SGI Altix

ing uses a work-first scheduling strategy coupled with a randomizedwork stealing load balancing strategy shown to be optimal [7]. Ouruse of shared queues is inspired by Parallel Depth-First Schedul-ing (PDFS) [5], which attempts to maintain a schedule close serialexecution order, and its constructive cache sharing benefits [12].

The first prototype compiler and run time for OpenMP 3.0 taskswas an extension of Nanos Mercurium [30]. An evaluation ofscheduling strategies for tasks using Nanos compared centralizedbreadth-first and fully-distributed depth-first work stealing sched-ulers [14]. Later extensions to Nanos included internal dynamiccut-off methods to limit overhead costs by inlining tasks [13].

In addition to OpenMP 3.0, there are currently several othertask parallel languages and libraries available to developers: Mi-

Page 11: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

0  

30  

60  

90  

120  

150  

16   32   64   96   128   192  

Speedu

p  

Number  of  Threads  

MTS   WS   CQ   GCC  

Figure 20: SparseLU-for on SGI Altix

0  

4  

8  

12  

16  

20  

24  

28  

32  

Health   Sort   Strassen  

Speedu

p  

MTS   WS   CQ   GCC  

Figure 21: Health, Sort, and Strassen on SGI Altix: 32 threads

crosoft Task Parallel Library [23] for Windows, Intel Thread Buld-ing Blocks (TBB) [22], and Intel Cilk Plus [21] (formerly Cilk++).The task parallel model and its run time support are also key com-ponents of the X10 [10] and Chapel [9] languages.

Hierarchical work stealing, i.e., stealing at all levels of a hier-archical scheduler, has been implemented for clusters and grids inSatin [31], ATLAS [4], and more recently in Kaapi [28, 19]. Thoselibraries are not optimized for shared caches in multi-core, whichis the basis for the shared LIFO queue at the lower level of our hi-erarchical scheduler. The ForestGOMP run time system [8] alsouses work stealing at both levels of its hierarchical scheduler, butlike our system targets NUMA shared memory systems. It sched-ules OpenMP nested data parallelism by clustering related threads(not tasks) into “bubbles,” scheduling them by work stealing amongcores on the same chip, and selecting for work stealing betweenchips those threads with the lowest amount of associated memory.Data is migrated between sockets along with the stolen threads.

6. CONCLUSIONS AND FUTURE WORKAs multicore systems proliferate, the future of software devel-

opment for supercomputing relies increasingly on high level pro-gramming models such as OpenMP for on-node parallelism. Therecently added OpenMP constructs for task parallelism raise thelevel of abstraction to improve programmer productivity. However,if the run time can not execute applications efficiently on the avail-able multicore systems, the benefits will be lost.

The complexity of multicore architectures grows with each hard-ware generation. Today, even off-the-shelf server chips have 6-12

cores and a chip-wide shared cache. Tomorrow may bring 30+

cores and multiple caches that service subsets of cores. Exist-ing scheduling approaches were developed based on a flat systemmodel. Our performance study revealed their strengths and limi-tations on a current generation multi-socket multicore architectureand demonstrated that mirroring the hierarchical nature of the hard-ware in the run time scheduler can indeed improve performance.Qthreads (by way of ROSE) accepts a large number of OpenMP3.0 programs, and, using our MTS scheduler, has performance ashigh or higher than the commonly available OpenMP 3.0 imple-mentations. Its combination of shared LIFO queues and work steal-ing maintains good load balance while supporting effective cacheperformance and limiting overhead costs. On the other hand, purework stealing has been shown to provide the least variability in per-formance, an important consideration for distributed applicationsin which barriers cause the application to run at the speed of theslowest worker, e.g., in a Bulk Synchronous Processing (BSP) ap-plication with task parallelism used in the computation phase.

The scalability results on the SGI Altix are important becauseprevious BOTS evaluations [16, 26] only presented results on up to32 cores. It is encouraging that several benchmarks reach speedupof 90X-150X on 196 cores. The challenge on those benchmarks isto close the performance gap between observed speedup and idealspeedup through incremental reductions in overhead costs and idletimes and better exploitation of locality. Other benchmarks fail toscale well even at 32 threads or less. On the data-intensive sort andhealth benchmarks we have observed a sharp increase in compu-tation time due to increased load latencies compared to sequentialexecution. To ameliorate that issue, we are investigating program-mer annotations to specify task scheduling constraints that identifyand maintain data locality.

One challenge posed by our hierarchical scheduling strategy isthe need for an efficient queue supporting concurrent access onboth ends, since workers within a shepherd share a queue. Mostexisting lock-free queues for work stealing, such as the Arora, Blu-mofe, and Plaxton (ABP) queue [2] and resizable variants [20, 11],allow only one thread to execute push() and pop() operations.Lock-free doubly-ended queues (deques) generalize the ABP queueto allow for concurrent insertion and removal on both ends of thequeue. Lock-free deques have been implemented with compare-and-swap atomic primitives [25, 29], but speed is limited by theiruse of linked lists. We are currently working to implement an array-based lock-free deque, though even with a lock-based queue wehave achieved results competitive with and in many cases betterthan ICC and GCC.

7. ACKNOWLEDGMENTSThis work is supported in part by a grant from the United States

Department of Defense. Sandia is a multiprogram laboratory oper-ated by Sandia Corporation, a Lockheed Martin Company, for theUnited States Department of Energy’s National Nuclear SecurityAdministration under contract DE-AC04-94AL85000.

8. REFERENCES[1] G. A. Alverson, R. Alverson, D. Callahan, B. Koblenz,

A. Porterfield, and B. J. Smith. Exploiting heterogeneousparallelism on a multithreaded multiprocessor. In ICS ’92:Proc. 6th ACM Intl. Conference on Supercomputing, pages188–197. ACM, 1992.

[2] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Threadscheduling for multiprogrammed multiprocessors. In SPAA’98: Proc. 10th ACM Symposium on Parallel Algorithms andArchitectures, pages 119–129. ACM, 1998.

Page 12: OpenMP Task Scheduling Strategies for Multicore NUMA Systems · OpenMP barrier construct have also been overloaded to require completion of all outstanding tasks. Execution of OpenMP

[3] E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin,F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. Thedesign of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst.,20:404–418, March 2009.

[4] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer.Atlas: an infrastructure for global computing. In EW 7: Proc.7th ACM SIGOPS European Workshop, pages 165–172, NY,NY, 1996. ACM.

[5] G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provablyefficient scheduling for languages with fine-grainedparallelism. JACM, 46(2):281–321, 1999.

[6] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson,K. Randall, and Y. Zhou. Cilk: An efficient multithreadedruntime system. In PPoPP ’95: Proc. 5th ACM SIGPLANSymposium on Principles and Practice of ParallelProgramming, pages 207–216. ACM, 1995.

[7] R. Blumofe and C. Leiserson. Scheduling multithreadedcomputations by work stealing. In SFCS ’94: Proc. 35thAnnual Symposium on Foundations of Computer Science,pages 356–368. IEEE, Nov. 1994.

[8] F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P.-A.Wacrenier, and R. Namyst. Structuring the execution ofOpenMP applications for multicore architectures. In IPDPS2010: Proc. 25th IEEE Intl. Parallel and DistributedProcessing Symposium. IEEE, April 2010.

[9] B. Chamberlain, D. Callahan, and H. Zima. Parallelprogrammability and the Chapel language. IJHPCA,21(3):291–312, 2007.

[10] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,K. Ebcioglu, C. von Praun, and V. Sarkar. X10: Anobject-oriented approach to non-uniform cluster computing.In OOPSLA ’05: Proc. 20th ACM SIGPLAN Conference onObject Oriented Programming Systems, Languages, andApplications, pages 519–538, NY, NY, 2005. ACM.

[11] D. Chase and Y. Lev. Dynamic circular work-stealing deque.In SPAA ’05: Proc. 17th ACM Symposium on Parallelism inAlgorithms and Architectures, pages 21–28. ACM, 2005.

[12] S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis,A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix,N. Hardavellas, T. C. Mowry, and C. Wilkerson. Schedulingthreads for constructive cache sharing on CMPs. In SPAA’07: Proc. 19th ACM Symposium on Parallel Algorithms andArchitectures, pages 105–115. ACM, 2007.

[13] A. Duran, J. Corbalán, and E. Ayguadé. An adaptive cut-off

for task parallelism. In SC08: ACM/IEEE Supercomputing2008, pages 1–11, Piscataway, NJ, 2008. IEEE Press.

[14] A. Duran, J. Corbalán, and E. Ayguadé. Evaluation ofOpenMP task scheduling strategies. In R. Eigenmann andB. R. de Supinski, editors, IWOMP ’08: Proc. Intl. Workshopon OpenMP, volume 5004 of LNCS, pages 100–110.Springer, 2008.

[15] A. Duran and X. Teruel. Barcelona OpenMP Tasks Suite.http://nanos.ac.upc.edu/projects/bots, 2010.

[16] A. Duran, X. Teruel, R. Ferrer, X. Martorell, andE. Ayguadé. Barcelona OpenMP Tasks Suite: A set ofbenchmarks targeting the exploitation of task parallelism inOpenMP. In ICPP ’09: Proc. 38th Intl. Conference onParallel Processing, pages 124–131. IEEE, Sept. 2009.

[17] Free Software Foundation Inc. GNU Compiler Collection.http://www.gnu.org/software/gcc/, 2010.

[18] M. Frigo, C. E. Leiserson, and K. H. Randall. Theimplementation of the Cilk-5 multithreaded language. In

PLDI ’98: Proc. 1998 ACM SIGPLAN Conference onProgramming Language Design and Implementation, pages212–223. ACM, 1998.

[19] T. Gautier, X. Besseron, and L. Pigeon. Kaapi: A threadscheduling runtime system for data flow computations oncluster of multi-processors. In PASCO ’07: Proc. 2007 Intl.Workshop on Parallel Symbolic Computation, pages 15–23.ACM, 2007.

[20] D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sizednonblocking work stealing deque. Distributed Computing,18:189–207, 2006.

[21] Intel Corp. Intel Cilk Plus.http://software.intel.com/en-us/articles/intel-cilk-plus/, 2010.

[22] A. Kukanov and M. Voss. The foundations for scalablemulti-core software in Intel Threading Building Blocks. IntelTechnology Journal, 11(4), Nov. 2007.

[23] D. Leijen, W. Schulte, and S. Burckhardt. The design of atask parallel library. SIGPLAN Notices: OOPSLA ’09: 24thACM SIGPLAN Conference on Object OrientedProgramming Systems, Languages, and Applications,44(10):227–242, 2009.

[24] C. Liao, D. J. Quinlan, T. Panas, and B. R. de Supinski. AROSE-based OpenMP 3.0 research compiler supportingmultiple runtime libraries. In M. Sato, T. Hanawa, M. S.Müller, B. M. Chapman, and B. R. de Supinski, editors,IWOMP 2010: Proc. 6th Intl. Workshop on OpenMP, volume6132 of LNCS, pages 15–28. Springer, 2010.

[25] M. M. Michael. CAS-based lock-free algorithm for shareddeques. In H. Kosch, L. Böszörményi, and H. Hellwagner,editors, Euro-Par 2003: Proc. 9th Euro-Par Conference onParallel Processing, volume 2790 of LNCS, pages 651–660.Springer, 2003.

[26] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F.Prins. Scheduling task parallelism on multi-socket multicoresystems. In ROSS ’11: Proc. Intl. Workshop on Runtime andOperating Systems for Supercomputers (in conjunction with2011 ACM Intl. Conference on Supercomputing), pages49–56. ACM, 2011.

[27] OpenMP Architecture Review Board. OpenMP API,Version 3.0, May 2008.

[28] J.-N. Quintin and F. Wagner. Hierarchical work-stealing. InEuroPar ’10: Proc. 16th Intl. Euro-Par Conference onParallel Processing: Part I, pages 217–229, Berlin,Heidelberg, 2010. Springer.

[29] H. Sundell and P. Tsigas. Lock-free and practical doublylinked list-based deques using single-wordcompare-and-swap. In T. Higashino, editor, OPODIS 2004:8th Intl. Conference on Principles of Distributed Systems,volume 3544 of LNCS, pages 240–255. Springer, 2005.

[30] X. Teruel, X. Martorell, A. Duran, R. Ferrer, andE. Ayguadé. Support for OpenMP tasks in Nanos v4. InK. A. Lyons and C. Couturier, editors, CASCON ’07: Proc.2007 Conference of the Center for Advanced Studies onCollaborative Research, pages 256–259. IBM, 2007.

[31] R. van Nieuwpoort, T. Kielmann, and H. E. Bal. Satin:Efficient parallel divide-and-conquer in Java. In Euro-Par’00: Proc. 6th Intl. Euro-Par Conference on ParallelProcessing, pages 690–699, London, UK, 2000. Springer.

[32] K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: AnAPI for programming with millions of lightweight threads.In IPDPS 2008: Proc. 22nd IEEE Intl. Symposium onParallel and Distributed Processing, pages 1–8. IEEE, 2008.