An early prototype of an autonomic performance environment ... · system software, programming models and languages, appli-cations, and crosscutting issues. E↵orts in these areas

An Early Prototype of an Autonomic

Performance Environment for Exascale

Kevin Huck,Sameer Shende,

Allen MalonyUniversity of Oregon

5294 University Of OregonEugene, Oregon 97403

{khuck,sameer,malony}@cs.uoregon.edu

Hartmut KaiserCenter for Computation &

TechnologyLouisiana State University

216 Johnston HallBaton Rouge, LA 70803

[email protected]

Allan Porterfield,Rob Fowler

RENCI100 Europa Dr.

Chapel Hill, NC 27517{akp,rjf}@renci.org

Ron BrightwellSandia National Labs

PO Box 5800Albuquerque, NM [email protected]

ABSTRACTExtreme-scale computing requires a new perspective on therole of performance observation in the Exascale system soft-ware stack. Because of the anticipated high concurrency anddynamic operation in these systems, it is no longer reason-able to expect that a post-mortem performance measure-ment and analysis methodology will su�ce. Rather, there isa strong need for performance observation that merges first-and third-person observation, in situ analysis, and introspec-tion across stack layers that serves online dynamic feedbackand adaptation. In this paper we describe the DOE-fundedXPRESS project and the role of autonomic performancesupport in Exascale systems. XPRESS will build an inte-grated Exascale software stack (calledOpenX ) that supportsthe ParalleX execution model and is targeted towards futureExascale platforms. An initial version of an autonomic per-formance environment called APEX has been developed forOpenX using the current TAU performance technology andresults are presented that highlight the challenges of highlyintegrative observation and runtime analysis.

1. INTRODUCTIONThe challenges to achieving extreme-scale computing for

DOE mission-critical applications demand basic research infuture system software and programming models if strong-scaled problems today will derive performance advantagefrom Moore’s Law and a broader range of problems will beable to fully exploit Exascale computing capabilities by theend of this decade. We are in the first year of a three-year

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ROSS ’13, June 10, 2013, Eugene, Oregon, USACopyright 2013 ACM 978-1-4503-2146-4/13/06 ...$15.00.

DOE-funded X-Stack project, eXascale PRogramming En-vironment and System Software (XPRESS), to investigateprogramming methods, runtime software, system services,and tools needed to create a proof-of-concept system soft-ware stack for Exascale, focused on strong-scaled computingfor real-world applications. The XPRESS software devel-opment will be enabled and driven by innovative conceptsin future extreme-scale computing and research to exploreand evaluate them. These two project thrusts – concepts re-search and software stack development – will be coordinatedand mutually supportive. XPRESS will engage the tal-ents and contributions of Indiana University, Louisiana StateUniversity, University of Houston, University of Delaware,Oak Ridge National Laboratories, University of Oregon, andUniversity of North Carolina at Chapel Hill to provide com-prehensive coverage of the interrelated domains, disciplines,and advances necessary to inform and facilitate the researchin Exascale computing, applications, systems, and program-ming.

What makes XPRESS unique as a project is that it willdeliver a new and complete system software architecturebased on the innovative ParalleX execution model [17]. Par-alleX has been devised to address challenges of starvation,overhead, latency, and contention by enabling a new com-puting dynamic through the application of message-drivencomputation in a global address space context with lightweightsynchronization. XPRESS will design an integrated Exas-cale software stack for ParalleX, OpenX, and implement acritical proof-of-concept prototype to validate its conceptsfor achieving high levels of parallelism and resource e�-ciency.

One of the key components of the XPRESS project isa new approach to performance observation, measurement,analysis and runtime decision making in order to optimizeperformance. The particular challenges of accurately mea-suring the performance characteristics of ParalleX applica-tions requires a new approach to parallel performance ob-servation. The standard model of multiple operating sys-tem processes and threads observing themselves in a first-

person manner while writing out performance profiles ortraces for o✏ine analysis will not adequately capture thefull execution context, nor provide opportunities for run-time adaptation within OpenX. The approach taken in theXPRESS project is a new performance measurement sys-tem, called APEX (Autonomic Performance Environmentfor eXascale). APEX will include methods for informationsharing between the layers of the software stack, from thehardware through operating and runtime systems, all theway to domain specific or legacy applications. The perfor-mance measurement components will incorporate relevantinformation across stack layers, with merging of third-personperformance observation of node-level and global resources,remote processes, and both operating and runtime systemthreads.

We have taken the first steps at realizing APEX with thecurrent OpenX runtime system, HPX, the TAU PerformanceSystem, and the RCRToolkit. Results on these initial e↵ortsare reported on below. First, an overview of the XPRESSproject is given, followed by a description of OpenX com-ponents. We then discuss our APEX prototype implemen-tation. Our objective is to learn from experiments with thesoftware about the issues involved and how the relative soft-ware components need to be evolved to address them. Theprototype is destined to be discarded, but the lessons learnedwill directly inform the final APEX version.

2. XPRESS OVERVIEW

2.1 VisionThe XPRESS project envisions a class of systems, both

software and hardware, that reflects the new realities ofemerging enabling component technologies and responds tothe challenges they impose in terms of starvation, latency,overheads and contention, as well as the availability con-straints of reliability, power consumption, and applicationuser programmability. Innovative concepts and their syn-thesis are imperative if sustained exaflops performance is tobe realized and applied for DOE mission-critical require-ments. Future extreme-scale computational systems willmove from previous static largely compiler and user basedresource management to future dynamic runtime based sys-tem and work supervision with an entirely new form oflightweight kernel operating system exhibiting a single sys-tem image but comprising an ensemble of potentially mil-lions of operating system agents operating in synergy andthrough a revolutionary relationship with runtime control.

For the first time, the tension between machine responsi-bilities and applications requirements will be balanced by asystem incorporating this new dynamic runtime versus op-erating system relationship. The logical point of exchangeis the Prime Medium, a new system-wide interface protocolthat supports bi-directional information exchange includinglocal and global state and mutual requirements of each other.Instrumentation measurements, faults, power consumption,utilization and demand, changes in requirements, locality at-tributes, and parallelism discovery are all aspects of the newvision of computing that will be enabled by the symbioticrelationship to be established between the runtime and oper-ating software systems. But the vision must and does extendfurther to the programming models and methods that mustprovide the end user with abstractions and means of eliciting10,000 times more parallelism than conventional practices

LegacyApplications

New Model Applications

MPIMetaprogramming

FrameworkDomain SpecificActive Library

Compiler

AGASname spaceprocessor

LCOdataflow, futuressynchronization

LightweightThreads

context manager

Parcelsmessage driven

computation

...

OpenMP

XPI

Task recognition Address space control

Memory bankcontrol

OSthread In

stru

men

tatio

n

Networkdrivers

Distributed FrameworkOperating System

HardwareArchitecture

OperatingSystem

Instances

RuntimeSystem

Instances

PRIME MEDIUMInterace / Control

{{

Domain SpecificLanguage

+106 nodes � 103 cores / node + integration network

...

Figure 1: Major components of the OpenX archi-tecture stack.

and the ability to exploit runtime information and instru-mentation for resource usage. New APIs both low level anddomain specific will be required and provided to work in con-text of the new class of Exascale systems even as means oflegacy mitigation allow old applications to run on new hard-ware/software system platforms. Among the interstices ofthese concepts and the system modules that manifest themis key support and new functionality to be attributed to andsupported by advanced compilation methods.

2.2 XPRESS ProjectThe XPRESS project is comprised of four major thrusts:

system software, programming models and languages, appli-cations, and crosscutting issues. E↵orts in these areas willbe defined and guided by key metrics.

2.2.1 Exascale System SoftwareThe HPX-3 runtime system provides an early starting

point as a programming tools and operating system target atthe beginning of the XPRESS project. This will be phasedout as HPX-4 - based on a new modular software architec-ture, OpenX - is developed incorporating added functional-ity for fault tolerance and power management to provide arobust open-source runtime system. The LXK lightweightkernel operating system will be developed based on the ad-vanced Kitten operating system [25, 5] in response to thenew requirements for billion-way concurrency, introspectivemanagement of faults and power, assumed future directionsof system architectures while dealing with near term sys-tems, and management of protected and dynamic globalvirtual name space. LXK will be co-designed with HPX-4around the centerpiece of the Prime Medium interface be-

tween the runtime and operating system software. This in-terface will share information in both directions between thetwo major software layers for performance, reliability, andcontrol of power consumption. The Open-X software stackis shown in Figure 1.

2.2.2 Programming Models and LanguagesTwo programming methods will be employed to provide

early means of conducting application kernel driven exper-iments and to facilitate ease of programming and portabil-ity. A low-level imperative programming interface, XPI, willbe developed to expose the semantic constructs comprisingthe ParalleX execution model embodied in the experimentalHPX runtime system. To facilitate the creation of diversedomain specific programming libraries, the XPRESS projectwill develop a set of tools, a meta-programming framework,that will support rapid deployment of future embedded DSLformalisms exploiting the potential of ParalleX-based execu-tion environments. The project will also develop an exampleof a DSL using the meta-DSL toolkit. The project will ex-plore at depth and provide means of legacy mitigation toensure seamless transition of legacy codes embodied withthe MPI or OpenMP programming interfaces to the OpenXsoftware stack. The project will develop and demonstrateinterfaces to XPI for legacy code execution, provide for in-teroperability between software modules in both forms, andprovide means for extending the parallelism incrementallywithin the MPI and OpenMP frameworks to increase scal-ability through the XPRESS OpenX software stack. APEXwill provide instrumentation methods for performance mea-surement to XPI, DSL, and legacy codes.

2.2.3 ApplicationsThree computational science domains are included: cli-

mate science, nuclear energy, and plasma physics, repre-sented by the Community Earth SystemModel (CESM) [30],Denovo [3], and GTC [19, 20, 7], respectively. GTC is asmall code already being redesigned for ParalleX. CESM andDenovo are both large, complex applications that addressimportant science and engineering questions with stringentnumerical requirements, numerous configuration and scienceoptions, a need to retain the ability to run on modest com-putational resources, and a need to run on Exascale systems.CESM and Denovo will be used to verify the ability to tran-sition smoothly from the current MPI+OpenMP program-ming model to the new XPRESS runtime and programmingmodels. A fourth driver is the collection of libraries in Trili-nos [11], which service the needs of hundreds of applicationswithin DOE and across the world.

2.2.4 Crosscutting IssuesEssential cross-cutting functionality includes automatic

control and introspection, resilience, power management andheterogeneity. Power management software in combinationwith anticipated energy e�cient hardware will achieve muchgreater resource utilization per joule while dramatically re-ducing data movement, a major source of power consump-tion, through active locality management. Resilience willbe achieved through a combination of in-memory micro-checkpointing that temporarily stores selected past data inmemory in case of failures until no longer required for fu-ture calculations, with a new compute-validate-commit cyclethat defers side-e↵ects until previous calculations have been

Process Manager Local Memory Management

AGASTranslation

Interconnect

Parcel Port Parcel

Handler

Action Manager Thread

ManagerThread

Pool

LocalControlObjects

Performance Counters

Performance Monitor

...

Figure 2: Modular structure of the HPX-3 imple-mentation. HPX-3 implements the supporting func-tionality for all of the elements needed for the Par-alleX model: AGAS (active global address space),parcel port and parcel handlers, HPX-threads andthread manager, ParalleX processes, LCOs (localcontrol objects), performance counters enabling dy-namic and intrinsic system and load estimates, andthe means of integrating application specific compo-nents.

determined correct thus isolating error propagation.

3. RELEVANT COMPONENTS

3.1 HPXHigh Performance ParalleX (HPX [17, 16, 28, 2]) is the

first open-source implementation of the ParalleX executionmodel. HPX is a modular, state-of-the-art runtime sys-tem developed for conventional architectures and, currently,Linux-based systems, such as large Non Uniform MemoryAccess (NUMA) machines and clusters. Strict adherenceto Standard C++ 11 [29] and the utilization of the BoostC++ Libraries [4] makes HPX both portable and highly opti-mized. Its modular framework facilitates simple compile orruntime configuration and minimizes the runtime footprint.The current implementation of HPX supports all of the keyParalleX paradigms; Parcels and Parcel transport layer forinter-locality communication, PX-threads and their manage-ment, Local Control Objects (LCOs) for lightweight syn-chronization of medium grain parallelism, the Active GlobalAddress Space (AGAS) as the basis to e�cient resourcemanagement, and PX-processes to represent and managethe dynamic execution structure of an application at run-time.

The existing HPX-3 library (shown in Figure 2) suppliesa valuable initial environment for the other parts of thisproject by providing immediate availability of stable pro-gramming interfaces and a functional, ParalleX-compliantlow level runtime environment. Moreover, the experiencescollected during several years of HPX development are highlyrelevant to the design, implementation, and optimization ofall related XPRESS software layers, including the lightweightkernel, the distributed operating system, the improved next-generation runtime system HPX-4, and the development ofmigration path for legacy applications. The latter will takefull advantage of the existing code by pairing it with inter-faces that (a) permit direct access to core HPX functional-ity from traditional C and Fortran programs via lightweightprogramming interface (XPI) for advanced programmers who

wish to explore it, and (b) enable traditional MPI and OpenMPfunctions by providing a thin translation layer for MPI com-bined with compiler support for OpenMP operation.

3.2 APEXScaling systems to extreme size capabilities is happening

through three architectural trends. First, the number ofcomputational cores on each chip and node will continue toincrease. Second, the number of nodes will continue grow.Third, systems will become more heterogeneous at levels in-cluding accelerator modules (GPUs, MIC [14]), shared func-tional units on chip (IBM BlueGene and AMD Bulldozer),and performance heterogeneous multi-core chips based on asingle instruction set architecture [18, 21]. The performance,energy, and health measurement and control infrastructureof XPRESS must address all of these.

Performance is no longer constrained primarily by thethroughput of individual cores or the behavior of the codeexecuting in a single pipeline. Increasingly cores interactwith each other through the shared resources. These inter-actions are manifested by bottlenecks and queuing at scarceresources both on a chip (node), and between nodes. Thesystem measurement and control components of XPRESSneed to address all of these areas. Furthermore, to be use-ful for introspective adaptive control, both measurementsand analyses need to be available in real time to softwarecomponents at all levels.

Over the last ten years, high-performance computing (HPC)performance methods have evolved incrementally to servethe dominant architectures and programming models. Theparallel performance abstractions embodied in HPC toolssuch as TAU [26], HPCToolkit [1], Open|SpeedShop [23],oprofile [15], and PAPI [6] are driven by the underlyingsystem environment. Performance observation requirementshave been satisfied by local measurements and o✏ine perfor-mance optimization. The stable, static model has allowedperformance tools to focus on a first person measurementmodel where performance is seen only by individual threadsand collected at the end of execution for post mortem analy-sis. The first person approach is inadequate for Exascale usesince the shared resources operate independently of the coreswith their own hardware monitors built in [13]. This re-quires using a system-wide third person measurement model.A key component of XPRESS will be a measurement andanalysis facility designed specifically to support introspec-tive adaptation for performance, energy, and reliability bymaking node-wide resource utilization data and analysis, en-ergy consumption, and health information available in realtime.

The entire Exascale software stack needs to be performance-aware and performance-reactive, able to observe performancestate holistically and to couple it with application knowl-edge for self-adaptive, runtime control. We use the termautonomic performance to reflect this idea. XPRESS willbe designed and developed with an autonomic performanceenvironment for Exascale (called APEX) with this new per-formance paradigm.

The role of APEX in the XPRESS project will be to serveboth top-down and bottom-up performance requirements ofthe OpenX software stack and its integrated operation. Thetop-down requirements are driven by the mapping of appli-cations to the ParalleX model and the translation throughthe programming models and the language compilers into

runtime operations and execution. The mapping and trans-lation process at each level includes performance abstrac-tions. Defining the set of parameters to be observed andthen coupling the abstractions to the observables is the func-tion of the hierarchical performance framework. The top-down view sees APEX functionality as part of applicationcreation specifically to provide integrated performance as-sessment and tuning. APEX observability support will bebuilt into each OpenX layer:

• LXK operating system will track system resource as-signment, utilization, job contention, and overhead.

• HPX will track threads, queues, concurrency, remoteoperations, parcels, and memory management.

• ParalleX, DSLs and legacy codes will allow language-level performance semantics to be measured.

The APEX observation infrastructure will be specializedthrough programming to enable autonomic performance ca-pabilities specific to the application. It is also importantto highlight the importance of mapping high-level semanticcontext top-down to low-level observation where context canbe used to distinguish between performance artifacts.

3.2.1 Bottom-up Development ApproachThe bottom-up requirements of APEX are targeted to

performance introspection across the OpenX layers to en-able dynamic, adaptive operation and decision control. Theworking model here is multi-parameter system optimization.APEX creates the performance feedback mechanisms andbuilds an e�cient infrastructure for connecting subscribersto runtime performance state. There are intra-level perfor-mance awareness needs for HPX and LXK, but these willclearly interplay with the overall application dynamics. Theperformance information provided for introspection at feed-back points will be the result of runtime analysis. The designof APEX to meet both top-down and bottom-up require-ments will enable closed-loop performance optimization forthe ParalleX execution model. APEX development will beconstrained by available Exascale technology. There are im-portant factors concerning the overhead of measurement, thecost of analysis, and the latency of feedback that will con-tribute to APEX’s e↵ectiveness. Our goal is to support thegeneral concept of performance portability through dynamicadaptivity.

The bottom-up approach for XPRESS, will extend previ-ous experimental work [8, 22, 24] on building decision sup-port instrumentation (RCRToolkit) for introspective adap-tive scheduling on current generation X86 64 Linux systems.In this approach, called Resource Centric Reflection, a sys-tem daemon (RCRdaemon) monitors shared, non-core re-sources, applies real-time analyses, and publishes both rawand processed information for use by other software compo-nents. The components include a display and logging tool(RCRtool) and an adaptive thread scheduler. Previously [8],performance was used with introspection to prevent per-formance degradation at very high loads in a commercialdatabase system. In more recent work, the MAESTROscheduler throttles thread/core concurrency to successfullyreduce power consumption without degrading performancein several cases and in other cases has increased throughput.HPCToolkit [1] was modified to use RCRToolkit informa-tion to distinguish cache misses that occur when there is a

memory bandwidth bottleneck from those that occur whenmemory is lightly utilized. Since memory channels are mon-itored independently, RCRToolkit has been used to detectand diagnose memory imbalance problems.

Within a “node” software components will communicatethrough a self-describing blackboard structure. This is ashared memory region that is organized as a hierarchy ofsingle-writer, multiple-reader data regions aligned along cacheblock and virtual memory page boundaries for performanceand access control purposes. The single-writer, multiple-reader organization allows lock-free access and minimizesmemory coherence overhead. The blackboard is self-describing,with a directory structure to locate regions owned by indi-vidual software modules as well as the data layout withineach region. The blackboard is a key component of thePrime Medium communication between the operating andruntime systems.

Good individual node performance does not automaticallyresult in good performance for hundreds of thousands ofnodes. To understand and adapt application performanceacross the entire system will require tools that communicatewith all of the single node blackboards to generate a globalview of performance.

Conventional techniques used for o↵-line performance anal-ysis such as profiling and tracing can generate a volume ofdata that will both impact application performance and re-quire substantial computing power. To address this, we willbuild on past work with hierarchical lossy wavelet compres-sion to filter out uninteresting parts of the performance sig-nals while preserving outliers and anomalous patterns [10].We have also used clustering [9] to further reduce data re-quirements for o↵-line analysis. These methods will extendthe blackboard mechanism hierarchically to each level atwhich control decisions are made.

3.2.2 Top-Down Development ApproachIn the APEX model performance observability require-

ments and semantics, starting with the application, are spec-ified through the programming model and languages andimplemented by instrumentation generated by the compilerthat invokes an HPX performance measurement API. It cou-ples the first person and third person performance views bytranslating application context down the OpenX stack inorder to associate the execution performance state back up.

APEX will integrate TAU instrumentation capabilities withthe XPRESS programmatic specialization methodology andits compiler framework, the legacy programming tools, andthe imperative API for ParalleX-based system programming.TAU’s mapping API will be used to create higher-level ab-stractions to contextualize lower-level measurements. Bothstatic and dynamic wrapper interposition libraries will in-tercept and instrument the imperative API and the runtimelayers of HPX. TAU’s multi-level instrumentation API willbe augmented with a performance feedback API that willsupply vital runtime information to a compiler for furtheranalysis.

4. APEX PROTOTYPE IMPLEMENTATIONIn the first year of the ongoing XPRESS project, the

project team has made significant strides with respect todesign of the APEX performance measurement infrastruc-ture, performance measurement of the HPX-3 runtime sys-tem, and beginning the transition from first-person perfor-

APEX:&monitoring&and&wri1ng&to&RCR&

RCR Blackboard

HW&State OS&State

HPX&State

XPI&State

DSL&or&Legacy&State

APEX Profiles, Traces

RCR Daemon

TAU

APEX:&@&“broadcasts”&performance&state&@&monitors&performance&state&@&“triggers”&an&ac1on&

Figure 3: The APEX prototype architecture.

mance reporting to third-person performance observation.The XPRESS team has implemented an APEX prototypeusing TAU [27] as the core measurement infrastructure. TheAPEX instrumentation interface provides access to TAUperformance timers and counters. The APEX prototypealso works with TAU event based sampling, providing per-formance observation of the HPX-3 runtime and applicationcode without instrumentation, which will guide the place-ment of additional APEX timers in the HPX-3 runtime.Figure 3 shows an architecture diagram of how APEX inte-grates with the layers of the OpenX stack.

Because we are only in the beginning stages of the project,significant functionality is missing from our current imple-mentation. As HPX has not yet been ported to Kitten/LXK,our implementation has currently only been tested on ex-isting x86 64 Linux clusters. Our implementation also doesnot yet provide a runtime control mechanism, nor true third-person observation and response. Those aspects of the de-sign will be developed over the remainder of the project.

The APEX prototype has been developed with the HPX-3 build process in mind, and is designed to integrate seam-lessly into the build. If the APEX variables are not definedat the configuration step, HPX-3 is configured and compiledas normal. If the APEX variables are defined with a refer-ence to the local APEX installation, the HPX-3 build systemwill enable APEX support.

HPX-3 previously provided support for various perfor-mance counters that could be periodically sampled and/orreported at program termination. Some examples of thesecounters include the runtime thread queue length, variousthread counts, and the thread idle rate. As these coun-ters were only output to the user terminal or a log output,the counters were not easily machine readable, nor in a formthat could leverage existing analysis tools. HPX-3 was mod-ified so that when the counters were observed they were alsorecorded by the APEX prototype, and subsequently storedin performance profiles at program termination. Supportingthe HPX-3 counters required a modification to TAU to cre-ate a new type of counter, a context user event. A contextuser event is annotated not only by the value it is measuring,but also the current execution context of the application, i.e.the current function or subroutine. The measurement li-

HPX$“main”$

HPX$user$level$

HPX$“helper”$OS$threads$

HPX$thread$sc

hedu

ler$loo

p$

GTCX$on$1$node$of$ACISS$cluster$at$UO$*$20$/$80$OS$threads$*$12$cores$(user)$

Figure 4: HPX-3 profile in TAU ParaProf.

brary also generates simple statistics (maximum, minimum,mean, variance). The context of the counters separates thosecollected during the thread scheduler part of the code fromthe application code.

The HPX-3 runtime thread manager was also instrumentedwith APEX timers. This instrumentation will provide in-sight into how much time is spent in runtime thread schedul-ing as opposed to actual application processing. An exampleof a profile collected with these timers is shown in Figure 4.This example was run with 48 total HPX threads on fournodes of ACISS [31], the computational cluster at the Uni-versity of Oregon. In addition to the twelve operating sys-tem threads that are directly involved in application execu-tion progress, HPX-3 has eight additional operating systemthreads. Instrumenting HPX-3 in this way will help HPXdevelopers in reducing the runtime thread scheduling over-head. Figure 5 shows a trace timeline of the thread schedulerin HPX-3 when executing an HPX-3 application.

Prior to the XPRESS project, HPX-3 had been instru-mented with Intel Instrumentation and Tracing Technology(Intel R� ITT) [12]. ITT provided detailed performance mea-surement of the HPX-3 runtime, not unlike the APEX im-plementation. However, ITT does not provide multi-node(multi-locality) support. In addition, there is no explicit an-notation of whether an HPX thread completed its executionor was preempted and migrated to another thread (possiblyeven to another locality). Finally, there is no explicit anno-tation of the provenance of the HPX thread, i.e. the HPXthread which spawned it. By capturing this information,APEX can support more performance scenarios than ITT.To that end, The ITT instrumentation points have been ex-tended to also provide detailed APEX instrumentation. Be-cause APEX does provides multi-node support, multi-nodeperformance experiments are possible with APEX.

One of the key aspects of the APEX third-person ob-servation design is to incorporate the full system contextin any performance observation. With that in mind, theXPRESS team has also been working to integrate the RCR-Toolkit library into the APEX measurement infrastructure.As described in § 3.2.1, RCRToolkit exposes holistic hard-ware and system observations that are not available in thecurrently executing process space through a shared memoryregion, called the RCRBlackboard. The APEX prototypehas integrated with the RCRBlackboard as a client applica-tion, so that non-core hardware measurements such as powerand energy can be taken. As the current implementation of

HPX$main$process$thread$

Orange:$HPX$thread$scheduler$loop$

HPX$“helper”$OS$threads$

Cyan:$HPX$user$level$

Figure 5: HPX-3 trace in TAU Jumpshot.

RCRToolkit has specific hardware and execution permissionrequirements, the XPRESS team has established commondevelopment systems at both RENCI and LSU that meetthe requirements. On the RENCI development system withan Intel Sandybridge architecture, APEX has successfullycollected non-core hardware measurements such as powerand energy from the RCRBlackboard, which is provided bythe RCRDaemon running on that system.

5. EXPERIMENTAL RESULTSGTC [19] is a 3D particle-in-cell (PIC) code developed for

studying turbulent transport in magnetic confinement fusionplasmas [20, 7]. In the PIC method, the interaction betweenparticles is calculated using a grid on which the charge ofeach particle is deposited and then used in the Poisson equa-tion to evaluate the field. This is the scatter phase of thePIC algorithm. Next, the force on each particle is gatheredfrom the grid-base field and evaluated at the particle loca-tion for use in the time advance. The PIC algorithm is themost expensive part of GTC.

Advances in multi-core technology have presented somechallenges for GTC in terms of strong scaling. Strategiesin combining OpenMP and MPI in GTC have not resultedin a clear solution to the scalability issue in the context ofmulti-core. This is a topic of active research. GTC hasalready been ported to C++ and HPX and algorithms arebeing modified to remove the global barriers which inhibitscalability.

To test the APEX prototype, HPX-3 was configured withAPEX support, and the GTC sample application was builtand executed on the ACISS cluster. The experiment wasperformed on a single node with 32 cores and 384GB of mem-ory per node. Figure 6 shows the variation in the amount oftime spent in the thread scheduler loop. The overhead of thescheduler varies from 15% to 18.8% of the total executiontime.

Because the overhead appears considerable for this appli-cation, more analysis is required. A trace of the same execu-tion was also collected. Figure 7 shows a zoomed-in region ofthe trace timeline, showing the 32 HPX worker threads and8 additional operating system threads. There appears tobe synchronization artifacts in the application, causing thesignificant overhead in the scheduler. The synchronizationoccurs when work cannot be executed until data dependen-

Figure 6: Performance profile of the HPX-3 threadscheduler loop while executing GTC. 32 HPXthreads were requested, resulting in 40 operatingsystem threads, total.

cies are resolved. In the trace figure, the cyan colored barsrepresent time spent in the gtcx_partition_loop_action

task, while the orange bars in-between represent the timespent in hpx-thread-scheduler-loop. Further analysis isrequired, but it appears that this application performancecan be improved if the “lockstep” behavior can be removedfrom the algorithm, or if its e↵ect can be mitigated.

6. DISCUSSION AND FUTURE WORKThe initial APEX prototype will be useful in providing

integrated performance measurement of HPX-3 applicationsrunning on current operating systems and platforms. How-ever, there are a number of immediate issues and future workto discuss. The integration of HPX-3 counters into APEX,as discussed in § 4, will need to be modified so that theyare not updated in APEX when periodically requested, butrather that they are updated directly when their values arechanged. In addition, as discussed in § 4, APEX providesdetailed instrumentation of the thread scheduler in HPX-3. However, like ITT, as yet there is no explicit annotationof whether an HPX thread completed its execution or waspreempted and migrated to another thread, and there is noexplicit annotation of the provenance of the HPX thread.Because TAU does not have the capability to capture thatinformation explicitly, APEX will be designed to capturethese important aspects of performance.

While the prototype implementation of APEX providesinsight into HPX applications, there are some drawbacks toour prototype approach. The most significant of these is thatthe performance data is not yet modeled in the ParalleXview of the world. For example, the performance modelcurrently used in APEX uses measurement dimensions thatinclude only the currently executing process and operatingsystem thread, with no explicit support for capturing the

Figure 7: Performance trace detail of the HPX-3 op-erating system threads while executing GTC. The 32runtime system threads are showing a synchronouspattern, limiting scalability and adding overhead.

HPX-3 runtime thread. However, until the ParalleX perfor-mance model is fully captured in APEX we still gain valuableinsight into the application by implementing our prototype itin existing tools, running on existing systems, and mappingback to the ParalleX model. In addition, more componentsof the HPX-3 runtime can and should be instrumented, in-cluding the parcel transport layer, local control objects, andthe active global address space.

Finally, there is much work to be done with respect toinformation sharing between components, runtime analysisand distillation of wide-scale performance data, integrationwith the operating system, measuring legacy and XPI levelapplication context, and an event-based model for triggeringruntime performance corrections. Over the lifetime of theXPRESS project, APEX will grow to provide these features.

7. ACKNOWLEDGEMENTSSupport for this work was provided through Scientific Dis-

covery through Advanced Computing (SciDAC) programfunded by U.S. Department of Energy, O�ce of Science,Advanced Scientific Computing Research (and Basic En-ergy Sciences/Biological and Environmental Research/HighEnergy Physics/Fusion Energy Sciences/Nuclear Physics)under award numbers DE-SC0008638, DE-SC0008704, DE-FG02-11ER26050 and DE-SC0006925. The ACISS cluster issupported by a Major Research Instrumentation grant fromthe National Science Foundation (NSF), O�ce of Cyber In-frastructure, grant number OCI-0960354. Sandia NationalLaboratories is a multi-program laboratory managed andoperated by Sandia Corporation, a wholly owned subsidiaryof Lockheed Martin Corporation, for the U.S. DOE’s Na-tional Nuclear Security Administration under contract DE-AC04-94AL85000.

8. REFERENCES[1] Adhianto, L., Banerjee, S., Fagan, M., Krentel,

M., Marin, G., Mellor-Crummey, J., andTallent, N. HPCToolkit: Tools for PerformanceAnalysis of Optimized Parallel Programs. Concurrencyand Computation: Practice and Experience 22, 6(2010), 685–701. http://hpctoolkit.org/.

[2] Anderson, M., Brodowicz, M., Kaiser, H., andSterling, T. L. An Application Driven Analysis ofthe ParalleX Execution Model. CoRR abs/1109.5201(2011). http://arxiv.org/abs/1109.5201.

[3] Baker, C., Davidson, G., Evans, T. M.,Hamilton, S., Jarrell, J., and Joubert, W. Highperformance radiation transport simulations:preparing for titan. In Proceedings of the InternationalConference on High Performance Computing,Networking, Storage and Analysis (Los Alamitos, CA,USA, 2012), SC ’12, IEEE Computer Society Press,pp. 47:1–47:10.

[4] Boost: a collection of free peer-reviewed portableC++ source libraries, 2011. http://www.boost.org/.

[5] Brightwell, R., and Pedretti, K. An intra-nodeimplementation of OpenSHMEM using virtual addressspace mapping. In Proceedings of the Fifth PartitionedGlobal Address Space Conference (October 2011).

[6] Dongarra, J., London, K., Moore, S., Mucci, P.,and Terpstra, D. Using PAPI for hardwareperformance monitoring on linux systems. InInternational Conference on Linux Clusters: The HPCRevolution (June 2001).

[7] Ethier, S., Tang, W. M., and Lin, Z. Gyrokineticparticle-in-cell simulations of plasma microturbulenceon advanced computing platforms. J. Phys: Conf.Ser.16 (2005).

[8] Fowler, R., Cox, A., Elnikety, S., andZwaenepoel, W. Using Performance Reflection inSystems Software. In Proceedings of USENIXWorkshop on Hot Topics in Operating Systems(HOTOS IX) (Lihue, HI, Mar. 2003). Extendedabstract.

[9] Gamblin, T., de Supinski, B., Schulz, M.,Fowler, R., and Reed, D. E�ciently clusteringperformance data at massive scales. In Proceedings ofthe International Conference on Supercomputing 2010(ICS2010) (Tsukuba, Japan, June 2010), ACM.

[10] Gamblin, T., de Supinski, B. R., Schultz, M.,Fowler, R., and Reed, D. A. Scalable load-balancemeasurement for SPMD codes. In Proceedings ofSupercomputing 2008 (Austin, TX, Nov. 2008),ACM/IEEE.

[11] Heroux, M., Bartlett, R., Hoekstra, V. H. R.,Hu, J., Kolda, T., Lehoucq, R., Long, K.,Pawlowski, R., Phipps, E., Salinger, A.,Thornquist, H., Tuminaro, R., Willenbring, J.,and Williams, A. An Overview of Trilinos. Tech.Rep. SAND2003-2927, Sandia National Laboratories,2003.

[12] Intel. IntelR� ITT API open source version.http://software.intel.com/en-us/articles/

intel-itt-api-open-source, 2013.[13] Intel Corporation. Intel(R) Xeon(R) Processor

7500 Series Uncore Programming Guide, March 2010.[14] Intel Corporation. Intel MIC.

http://www.intel.com/content/www/us/en/

high-performance-computing/

high-performance-xeon-phi-coprocessor-brief.

html, 2013.[15] John Levon et al. OProfile.

http://oprofile.sourceforge.net/. 14 April 2006.

[16] Kaiser, H., Adelstein-Lelbach, B., et al. HPXSVN repository, 2011. Available under a BSD-styleopen source license. Contact [email protected] forrepository access.

[17] Kaiser, H., Brodowicz, M., and Sterling, T.ParalleX: An advanced parallel execution model forscaling-impaired applications. In Parallel ProcessingWorkshops (Los Alamitos, CA, USA, 2009), IEEEComputer Society, pp. 394–401.

[18] Kumar, R., Tullsen, D. M., Ranganathan, P.,Jouppi, N. P., and Farkas, K. I. Single-ISAheterogeneous multi-core architectures formultithreaded workload performance. ComputerArchitecture, International Symposium on 0 (2004),64.

[19] Lin, Z., Ethier, S., and Lewandowski, J. GTC:3D Gyrokinetic Toroidal Code, 2012.

[20] Lin, Z., Hahm, T. S., Lee, W. W., Tang, W. M.,and White, R. B. Turbulent transport reduction byzonal flows: Massively parallel simulations. Science281, 5384 (1998), 1835–1837.

[21] Nvidia Corporation. The benefits of quad coreCPUs in mobile devices.http://www.nvidia.com/content/PDF/tegra_white_

papers/tegra-whitepaper-0911a.pdf.[22] Olivier, S., Porterfield, A., Wheeler, K., and

Prins, J. Scheduling task parallelism on multi-socketmulticore systems. In International Workshop onRuntime and Operating Systems for Supercomputers(Tuson, AZ, USA, June 2011).

[23] Open|SpeedShop.http://www.openspeedshop.org/wp/.

[24] Porterfield, A., Fowler, R., and Lim, M. Y.RCRTool design document; version 0.1. Tech. Rep.RENCI Technical Report TR-10-01, RENCI, 2010.

[25] Sandia National Laboratories. The KittenLightweight Kernel.https://software.sandia.gov/trac/kitten.

[26] Shende, S., and Malony, A. The TAU ParallelPerformance System. International Journal of HighPerformance Computing Applications 20, 2, Summer(2006), 287–311. ACTS Collection Special Issue.

[27] Shende, S., and Malony, A. D. The TAU ParallelPerformance System. International Journal of HighPerformance Computing Applications 20, 2 (Summer2006), 287–331.

[28] STE | |AR Group. Systems Technologies, EmergingParallelism, and Algorithms Reseach, 2011.http://stellar.cct.lsu.edu.

[29] The C++ Standards Committee. ISO/IEC14882:2011, Standard for Programming LanguageC++. Tech. rep., ISO/IEC, 2011.http://www.open-std.org/jtc1/sc22/wg21.

[30] University Corporation for AtmosphericResearch. Community Earth System Model(CESM). http://www.cesm.ucar.edu, 2013.

[31] University of Oregon. ACISS.http://aciss.uoregon.edu, 2013.

Documents

An early prototype of an autonomic performance environment ... · system software, programming models and languages, appli-cations, and crosscutting issues. E↵orts in these areas