Rethinking High Performance Computing Platforms ... · T Rethinking High Performance Computing Platforms: Challenges, Opportunities and Recommendations Ole Weidner School of Informatics

DRAFT

Rethinking High Performance Computing Platforms:Challenges, Opportunities and Recommendations

Ole WeidnerSchool of Informatics

University of Edinburgh, [email protected]

Malcolm AtkinsonSchool of Informatics


Adam BarkerSchool of Computer ScienceUniversity of St Andrews, UK

[email protected]

Rosa FilgueiraSchool of Informatics


ABSTRACTA new class of “Second generation” high-performance com-puting applications with heterogeneous, dynamic and data-intensive properties have an extended set of requirements,which cover application deployment, resource allocation, -control, and I/O scheduling. These requirements are notmet by the current production HPC platform models andpolicies. This results in a loss of opportunity, productivityand innovation for new computational methods and tools.It also decreases effective system utilization for platformproviders due to unsupervised workarounds and “rogue” re-source management strategies implemented in applicationspace. In this paper we critically discuss the dominantHPC platform model and describe the challenges it cre-ates for second generation applications because of its asym-metric resource view, interfaces and software deploymentpolicies. We present an extended, more symmetric andapplication-centric platform model that adds decentralizeddeployment, introspection, bidirectional control and infor-mation flow and more comprehensive resource scheduling.We describe cHPC : an early prototype of a non-disruptiveimplementation based on Linux Containers (LXC). It canoperate alongside existing batch queuing systems and ex-poses a symmetric platform API without interfering withexisting applications and usage modes. We see our approachas a viable, incremental next step in HPC platform evolu-tion that benefits applications and platform providers alike.To demonstrate this further, we layout out a roadmap forfuture research and experimental evaluation.

CCS Concepts•Social and professional topics → Centralization /decentralization; Software selection and adaptation;•Computer systems organization → Reliability;

KeywordsHPC platform models; HPC platform APIs; usability; re-source management; OS-level virtualization; Linux contain-ers

1. INTRODUCTION

With computational methods, tools and workflows be-coming ubiquitous in more and more scientific domains anddisciplines, the software applications and user communitieson high performance computing platforms are rapidly grow-ing diverse. Many of the emerging second generation HPCapplications move beyond tightly-coupled, compute-centricmethods and algorithms and embrace more heterogeneous,multi-component workflows, dynamic and ad-hoc computa-tion and data-centric methodologies. While diverging fromthe traditional HPC application profile, many of these appli-cations still rely on the large number of tightly coupled cores,cutting-edge hardware and advanced interconnect topologiesprovided only by HPC clusters. Consequently, HPC plat-form providers often find themselves faced with requirementsand requests that are so diverse and dynamic that they be-come increasingly difficult to fulfill efficiently within the cur-rent operational policies and platform models. The balanc-ing act between supporting stable production science on theone hand and novel application types and exploratory re-search in high-performance, distributed, and scientific com-puting on the other, puts an additional strain on platformproviders. In many places this inevitably creates dissatisfac-tion and friction between the platform operators and theiruser communities. However, this largely remains an un-quantified, subjective perception throughout the user andplatform provider communities. We believe that platformproviders and users have a common mission to push theedge of the envelope of scientific discovery. Friction and dis-satisfaction creates an unnecessary loss of momentum andin the worst cases can cause productivity and innovation tostall.

Cloud computing offers an alternative paradigm to tra-ditional HPC platforms. Clouds offer on-demand utilitycomputing as a service, with resource elasticity and pay asyou go pricing. One important aspect to address is whetherour vision of a symmetric HPC platform isn’t just trying toturn HPC platforms into Cloud-like environments. This de-pends on the class and structure of the application. Looselycoupled, elastic applications, which don’t require guaran-teed performance make good candidates for virtualized en-vironments. However, there are several classes of applica-tion which typically run more effectively on dedicated HPCplatforms: (1) Performance sensitive applications: virtual-ization offers a significant performance overhead and due to

arX

iv:1

702.

0551

3v2

[cs

.DC

] 2

7 Fe

b 20

17

DRAFT

multi tenancy, application performance cannot usually beguaranteed; (2) Interconnect-sensitive applications, whichrequire co-location low latency and high throughput; (3)I/O-sensitive applications that, without a very fast I/O sub-system, will run slowly because of storage bottlenecks; (4)Applications which require dedicated and specialized hard-ware to run the computation; (5) Finally (an aspect which isoften overlooked) is the fact that the cost of migrating andstoring data in the cloud is high. Data intensive or big dataapplications often have data which is tethered to a site andtherefore the only viable option is to run the application ondedicated institutional resources.

Our primary objective will deliver HPC platforms thatprovide more flexible mechanisms and interfaces for appli-cations that are inherently dependent on their architecturaladvantages. Otherwise, we fear that evolution and inno-vation in second generation applications might come to agrinding halt as the platform is too confining for advanceduse-cases. We argue that these confining issues are largelycaused by a structural asymmetry between platforms andapplications. This asymmetry can be observed in the op-erational policies and as a consequence in production HPCplatform models and their technical implementations. Op-erational policies are characterized by a centralized soft-ware deployment processes managed and executed by theplatform operators. This impedes application developmentand deployment by creating a central bottleneck, especiallyfor second generation applications that are built on non-standard or even experimental software. Production HPCplatform models are characterized by static resource map-ping , an asymmetric resource model, and limited informa-tion exchange and control channels between platforms andtheir tenants. In this paper we propose changes to poli-cies and platform models to improve the application contextwithout jeopardizing platform stability and reliability:

1. By moving away from deployment monopolies, softwareprovisioning can be handled directly by the users andapplication experts, reducing bottlenecks, supporting in-creased application mobility, and creating a shared senseof responsibility and a better balance between the twostakeholders. This allows platform operators to focus onrunning HPC platforms at optimal performance, reliabil-ity, and utilization.

2. A more symmetric resource model, information and con-trol flow between platform and application will signifi-cantly improve platform operation while supporting thedevelopment and adoption of innovative applications, higher-level frameworks and supporting libraries. We show viapractical examples and existing research how a platformmodel that is built upon more symmetric information andcontrol flow can provide a solid supporting foundation forthis. The more applications will exploit this foundation,the easier it becomes to operate an HPC platform at thedesired optimal point of performance, reliability, and uti-lization. We suggest how to amend and extend existingplatform models and their implementations.

This paper is structured as follows: In section 2 we presentour experience and observations from working with threedifferent classes of second generation applications on pro-duction HPC platforms: dynamic applications (section 2.2),data-intensive applications (section 2.1), and federated ap-plications (section 2.3). In section 3 we describe the struc-

tural asymmetry in the operational policies (section 3.1) andplatform model (section 3.2) along with the challenges theyrepresent for the applications in our focus group. In section 4we recommend a more balanced and symmetric HPC plat-form model, based on isolated, user-driven software environ-ments (section 4.3), network and filesystem I/O as schedula-ble resources (section 4.2), and improved introspection andcontrol flow (section 4.1) between platforms and applica-tions. In section 5 we present cHPC, an early prototypeimplementation of our extended platform model based onoperating system-level virtualization and Linux containers(section 5.1). We suggest how such a system can coexistswith existing batch queueing systems (section 5.2). Finallyin ??, we list related work (??), and lay out our upcomingresearch agenda (section 6). The main contributions of thispaper are:

1. Identification of existing friction and issues in applicationdevelopment and between HPC platform providers andtheir tenants.

2. Recommendations for a more symmetric HPC platformmodel based on decentralized software deployment andsymmetric interfaces between platforms and applications.

3. A non-disruptive candidate implementation of the con-ceptual platform model based on operating system-levelvirtualization and containers that can operate alongsideexisting HPC queueing systems.

4. A research agenda and statement to further explore novelHPC platform models based on operating-system levelvirtualization.

2. FOCUS APPLICATIONSThe authors have deep experience in architecting, devel-

oping and running a diverse portfolio of second generationhigh-performance and distributed computing applications,tools and frameworks. These include tightly-coupled par-allel codes; distributed, data-intensive and dynamic appli-cations, and higher-level application and resource manage-ment frameworks. This experience shaped the position weare taking. This section characterizes the applications andthe challenges they are facing on today’s HPC platforms.

It would be false to claim that current production HPCplatforms fail to meet the requirements of their applicationcommunities. It would be equally wrong to claim that theexisting platform model is a pervasive problem that gener-ally stalls the innovation and productivity of HPC applica-tions. It is important to understand that significant classesof applications, often from the monolithic, tightly-coupledparallel realm, have few concerns regarding the issues out-lined in this paper. These applications produce predictable,static workloads, typically map well to the hardware archi-tectures and network topologies. They are developed andhand-optimized to utilize resources as efficiently as possi-ble. They are the original tenants and drivers of HPC andhave an effective social and technical symbiosis with theirplatform environments.

However, it is equally important to understand that otherclasses of applications (that we call second generation appli-cations) and their respective user communities share a lessrosy perspective. These second generation applications aretypically non-monolithic, dynamic in terms of their runtimebehavior and resource requirements, or based on higher-leveltools and frameworks that manage compute, data and com-

DRAFT

munication. Some of them actively explore new computeand data handling paradigms, and operate in a larger, feder-ated context that spans multiple, distributed HPC clusters.When evaluating the challenges, opportunities and recom-mendations that are laid out in this paper, the reader shouldkeep in mind the following three classes of applications.

2.1 Data-Intensive ApplicationsData-intensive applications require large volumes of data

and devote a large fraction of their execution time to I/O andmanipulation of data. Careful attention to data handling isnecessary to achieve acceptable performance or completion.They are frequently sensitive to local storage for interme-diate results and reference data. It is also sensitive to thedata-intensive frameworks and workflow systems availableon the platform and to the proximity of data it uses. Thismay be as a result of complex input data, e.g. from manysources, with potentially difficult access patterns, or with re-quirements for demanding data update patterns, or simplylarge volumes of input, output or intermediate data so thatI/O times or data storage resources limit performance.

Examples of large-scale, data-intensive HPC applicationsare seismic noise cross-correlation and missfit calculation asencountered, e.g. in the VERCE project [1]. Such computa-tions are the only way of observing the deep earth and arecritical in hazard estimation and responder support. Theforward wave simulations using SPECFEM-3D [3] imposevery demanding loads on today’s HPC clusters with fastcores and high-bandwidth interconnect. Critical geophysicsphenomena are 3 orders of magnitude smaller than currentsimulations. Hence another factor of 109 in computationalpower could be used. Inverting the seismic signals to buildearth models of sub-surface phenomena requires iterationsthat run the forward model, compare the results with seis-mic observations, misfit analysis, at each seismic station,and compute an adjunct to back propagate to refine themodel. The models are irregular finite element 3D mesheswith 106to107 cells. The noise correlations are modeled ascomplex workflows ingesting multivariate time series frommore than 1000 seismic stations. These data are preparedby a pipeline of pre-processing, analysis, cross-correlationand post-processing phases.

The main issues we encountered with these applicationswere the difficulty of establishing an environment that metall the prerequisites, the difficulty of establishing suitabledata proximity, the difficulty of efficiently handling interme-diate data, the difficulty of enabling users to inspect progressand the difficulty of arranging the appropriate balance ofthe properties of the hardware context. Coupling concur-rent parts of application running on different platforms (towhich they were well suited), porting the applications to newplatforms and avoiding moving large volumes of data oftenproved impossible.

2.2 Dynamic ApplicationsDynamic applications fall into two broad categories: (i)

applications for which we do not have full understanding ofthe runtime behavior and resource requirements prior to ex-ecution and (ii) applications which can change their runtimebehavior and resource requirements during execution. Dy-namic applications are driven by adaptive algorithms thatcan require different resources depending on their input dataand parameters. e.g. a data set can contain a specific area

of interest which triggers in-depth analysis algorithms or asimulation can yield an artifact or boundary condition thattriggers an increase in the algorithmic resolution. Examplesof dynamic HPC applications are: (a) applications that useensemble Kalman-Filters for data assimilation in forecasting(e.g. [5]), (b) simulations that use adaptive mesh refinement(AMR) to refine the accuracy of their solutions (e.g. [2]),and (c) seismic wave propagation simulations that modifytheir code by using compilation in their early phases (e.g.SPECFEM3D [3]). Many other examples exist.

The main issues we encountered running dynamic appli-cations on production HPC platforms originate in their dy-namic resource and time requirements. A Kalman-Filterapplication might run for two hours or for four hours, de-pending on the model’s initial conditions. Similarly an AMRsimulation of a molecular cloud might require an additional128 CPU cores during its computation to increase the reso-lution in an area of interest. Resource managers on produc-tion HPC platforms do not support such requirements: themaximum runtime is restricted by the walltime limit set atstartup. Resource requirements, e.g. CPU cores and mem-ory, are similarly set at startup. It is neither possible torequest an extension of the runtime nor inform a resourcemanager about reducing or increasing requirements.

This inflexibility of the platforms has lead to interesting,yet obscure application architectures: in [5] for example,the application is forced to opportunistically allocate addi-tional resources via an SSH connection to the job manageron the cluster head node and release them if they are notrequired. This technique (that can be found in many otherapplications) wastes platform resources and increases thecomplexity of the applications significantly by adding com-plex resource management logic, which detracts from theirstability and reliability. Generally, application developersare very creative when it comes to circumventing platformrestrictions. Overlay resource management systems suchas pilot jobs are becoming increasingly popular exactly be-cause they enable applications to achieve this effect. Wesee problems with this approach for applications and plat-forms. It adds active resource management as a burden onthe shoulders of developers and users. It dilutes the focus ofthe applications and adds more complexity and additionaldependencies on potentially short-lived software tools. Itincreases the expertise required to develop dynamic appli-cations and consequently restricts widespread adoption ofadaptive techniques. From the platform’s perspective, cir-cumventive methods invariably lead to poorer platform uti-lization. On the other hand, dynamic applications with-out active resource management lead to hollow utilizationas applications terminate without producing useful resultsor without a recent checkpoint.

2.3 Federated ApplicationsFederated HPC environments have become more and more

prominent in recent years. Based on the idea that fed-eration fosters collaboration and allows scalability beyonda single platform, policies and funding schemes explicitlysupporting the development of concepts and technology forHPC federations have been put into place. Larger feder-ations of HPC platforms are XSEDE in the US, and thePRACE in the EU. Both platforms provide access to sev-eral TOP-500 ranked HPC clusters and an array of smallerand experimental platforms. With policies and federated

DRAFT

user management and accounting in place, application de-velopers and computer science researchers are encouragedto develop new application models that can harness the fed-erated resource in new and innovative ways. Examples forresource federation systems are CometCloud [4] and RADI-CAL Pilot [6]. RADICAL Pilot is a pilot-job system that al-lows transparent job scheduling to multiple HPC platformsvia the SSH protocol. CometCloud is an autonomic com-puting engine for Cloud and HPC environments. It pro-vides a shared coordination space via an overlay networkand various types of programming paradigms such as Mas-ter/Worker, Workflow, and MapReduce/Hadoop. Both sys-tems have been used to build a number of different fed-erated computational science and engineering applications,including distributed replica exchange molecular dynamics,ensemble-based molecular dynamics workflows and medicalimage analysis. The federation platform provides the execu-tion primitives and patterns for the applications and mar-shals the job execution as well as the data transfer of input,output and intermediate data from, to and in between thedifferent HPC platforms.

Deployment and application mobility has been the biggestissue with federated applications and overlay platform pro-totypes. Even if federated platform use could be shown towork conceptually, in practice the applications were highlysensitive to and would often fail because of the software envi-ronment on the individual platforms. Software environmentsare not synchronized or federated across platforms. As a re-sult, different versions of domain software tools caused theapplication to fail. Automated compiling and installing ap-plications in user-space was difficult and sometime impos-sible due to incompatible compilers, runtimes and librarieson the target platform. Maintaining a large database oftools, versions, paths and command line scripts for the in-dividual platforms was a significant fraction of the overalldevelopment effort. Another issue was the limited resourceallocation, monitoring and control mechanisms provided bythe platforms, which would be crucial for an overarchingexecution platform to make informed decisions. Analogousto the discussion in section 2.2, applications are bound tostatic resource allocation. A common pattern we observedwas that federated application would schedule resources ondifferent platforms and just use the one that became avail-able first. We further observed that application users werelargely unsuccessful in allocating larger number of resourceson multiple platforms concurrently due to missing resourceallocation control. This makes the idea of having very largeapplications spanning more than one platform very difficultto achieve with production HPC platforms.

3. CHALLENGES AND OPPORTUNITIESIn this section we describe the details of operational poli-

cies and properties of the dominant HPC platform modelthat we have identified as structurally hindering in our ev-ery day work with second generation HPC applications. Theplatform model describes the underlying model and abstrac-tions of the software system that interfaces an HPC clusterwith its users. It defines the views and the interfaces usersand application have of the system. Important aspects ofthe platform model are (1) the interfaces provided to ex-ecute the application on the platform, (2) the view of theplatform’s hardware resources while application is running,(3) the view of the application while it is running on the

HPC platform, (4) interfaces provided to control the appli-cation while running on the platform.

The platform model is determined by the software systemthat is used to manage the platform. As the dominant plat-form model, we identify HPC platforms that are (1) man-aged by a job manager / queueing system, (2) provide ashared file system across all platform nodes, (3) use the con-cept of jobs as the abstraction for executing applications,and (4) which provide a single, global application executioncontext, the host operating system execution context. Weconsider it dominant because all production HPC platformswe have worked on and are aware of exhibit the same model.Only in the implementation details we observed differencesbetween platform, e.g. in their choice of distributed file sys-tems, queueing systems (PBS, SLURM, LoadLeveler, etc.)and host operating systems.

3.1 Existing Operational Policies

Software ProvisioningSoftware provisioning is a major issue that we have observedthroughout all classes and types of HPC applications, notjust the focus applications. Resource providers put a signifi-cant amount of effort into curating up-to-date catalogues ofthe software libraries, tools, compilers and runtimes that arerelevant to their user communities. Software managementtools like SoftEnv and Module are commonly used to sup-port this task. Because existing HPC platform models donot provide software environment isolation (like for exam-ple virtualized platforms), all changes made to a platform’ssoftware catalog have an impact on all users. Hence, theprocess is strongly guarded by the resource providers. Ver-sioning and compatibility of individual software packagesneed to be considered with every update or addition to thecatalog. As a consequence, getting an application and/orits dependencies installed on a platform requires direct in-teraction with, and in many cases persuasion of the resourceprovider. While some software packages are considered lesscritical, others, like for example an alternative compiler ver-sion, Python interpreter or experimental MPI library areconsidered disruptive and deployment is often refused. Soft-ware deployment in user space (i.e., the user’s home direc-tory) is an alternative, but in practice it has shown to be verytedious, error prone and difficult to automate. Affected fromthe software provisioning dilemma is also application mobil-ity (migratability). Because software environments cannotbe shared between different platforms, application mobilitycomes at the cost of a significant deployment overhead thatincreases linearly with the number of HPC platforms tar-geted. In several application projects we have experiencedsoftware provisioning as very hard and a time and resourceconsuming process. Especially for projects that aim at feder-ated usage of multiple HPC platforms, software deploymentbecomes a highly impeding factor.

NetworkingIn- and outbound networking differs between platforms. Lim-itations are determined by the platform architecture andconfiguration, as well as the networking policies (firewalls)of the organization operating the platform. We have en-countered everything, from platforms with no restrictions,to platforms from which in- and outbound network connec-tions were virtually impossible. These differences made it

DRAFT

extremely difficult, not only to federate platforms, but alsoto migrate applications. We observed several cases in whichapplications simply were not able to run on a specific plat-form because they were design around the assumption thatcommunication and data transfer between an HPC plat-form and the internet is not confined. Affected were ap-plications that dynamically load data from external serversor databases during execution or that rely on methods formonitoring and computational steering. In other cases theapplication’s performance was severely crippled as it had tofunnel all through through the head node instead of loadingthe data into the compute nodes directly because computenodes could not “dial out”. In addition, it is not possible toquery platforms for their networking configuration program-matically. Platform documentation or support mailing-listsare often the only way to gather this information.

3.2 Existing Platform Model

Static Resource MappingExisting platform models require users to define an appli-cation’s expected total runtime, CPU and memory require-ments prior to its execution. None of the job managers foundin production HPC platforms deviates from this model or al-lows for an amendment of the expected total runtime duringapplication execution. In our experience, enforcing staticwall-time limits lead to two unfavorable scenarios. In thefirst scenario, applications run at risk of being terminatedprematurely because their wall-time limit was set to opti-mistically. Especially dynamic applications are affected bythis. For the users this means that they wasted valuable re-source credits without producing any results. For the plat-form provider this means hollow utilization. In the secondscenario, users “learn” from the first scenario and define theapplication wall-time limit very pessimistically. Most jobmanagers weigh application scheduling priority against re-quested resources and wall-time. The higher the requestedwall-time limit, the longer an application has to wait for itsslot to run. This can significantly decrease user productiv-ity. If the application finishes ahead of its requested wall-time limit, the platform’s schedule is affected, which canresult in suboptimal resource utilization. The same limita-tions hold true for CPU and memory resources with similarimplication for the applications. Especially for dynamic ap-plications, static resource limits can become an obstacle toproductivity and throughput (see also section 2.2).

Asymmetric Resource ModelWhile static resource mapping can be a significant obstacle,it also ensures guaranteed resource availability and exclu-sive usage. Job managers require the user to define a fixednumber of CPU cores and optionally the required memoryper CPU core. The required network I/O bandwidth andfilesystem I/O operations per second (IOPS) however cannot be specified. When working with data-intensive appli-cations and applications that periodically need to read orwrite large amounts of data in bursts, this can lead to un-foreseeable variations in the overall performance and run-time of the application. Because network file system I/Obandwidth is shared with all other tenants on a platform,the I/O bandwidth available to a user’s application criticallydepends on the I/O load generated by its peers. We haveexperienced significant fluctuations in application runtime

(and resulting failure) because of unpredictably decreasingI/O bandwidth (see also section 2.1). We have experiencedsimilar issues with network I/O bandwidth. Just like filesystem bandwidth, the in- and outbound network connec-tions are shared among all platform tenants. Bandwidthrequirements can not be defined. Depending on the net-work’s utilization, the available bandwidth available to anapplication can fluctuate significantly. For data-intensiveapplications that download input data and upload outputdata from and to sources outside the platform boundaries,this can become a non-negligible slowdown factor and po-tential source of failure.

Missing Introspection, Control and CommunicationApplications evolve over time through an iterative loop ofanalysis and optimization. Every time a new algorithm orexecution strategy is added or a new platform is encoun-tered, its implications on the performance and stability ofthe application needs to be evaluated. For that, it is criticalto understand the behavior of the application processes theirinteraction with the platform and their resource utilizationprofile. Similarly, these insights are crucial for federated ap-plications and frameworks (see also section 2.3) to make au-tonomous decisions about application scheduling and place-ment. HPC platforms currently provide very few tools thatcan help to gain these insights and few are integrated withthe platform. Interfaces to extract operational metrics ofthe resources and the application’s jobs and processes arealmost entirely missing on production HPC platforms.

The same limitations hold true for the communicationchannels between platforms and applications. The controlsfrom application to platform are effectively limited to start-ing and stopping user jobs. The interface is usually confinedto the job manager’s command line tools and not designedfor programmatic interaction. In the opposite direction,communication is confined to operating system signals emit-ted by the platform. Applications can then decide whetherthey want to implement signal handlers to react to the sig-nals dispatched by the platform or not. Signals are littlemore than notifications about imminent termination.

4. RECOMMENDATIONSWe have identified static resource mapping, an incom-

plete resource model, missing introspection and control, anda centralized, inflexible software deployment model as themain inhibitors to second generation HPC applications. Inthis section we lay out the blueprint for a more balancedplatform model that incorporates introspection, bidirectionalinformation and control flow and decentralized deploymentas first-order building blocks.

4.1 Introspection and Control ModelMany HPC applications and higher-level application frame-

works do not implement common resilience and optimiza-tion strategies even though the knowledge is available viaresearch publications and prototype systems. We identifythe need for interfaces to retrieve the physical and logicalmodel, state of an application, the physical model and stateof the platform. This addresses the asymmetric platformissue, where once an application is submitted to an HPCplatform, users loose insight and control over their applica-tion almost entirely. Existing point solutions to establishcontrol and introspection for running applications are often

DRAFT

difficult to adapt or their deployment is infeasible due totheir invasiveness. Furthermore, introspection implementedredundantly on application-level adds additional pressure ona platform’s resources and adds additional sources of poten-tial error. Based on these premises, we propose a platformmodel that incorporates and builds upon symmetric intro-spection, information- and control-flow across the platform-application-barrier.

Physical and Logical Application ModelsOur approach distinguishes between the logical and the phys-ical model and state of an application. The logical modeland state is rendered within the application logic and is de-signed by the application developer. Logical state consistse.g. of the current state of an algorithm, the current state/ progress of the simulation of a physical model, or whetheran application is e.g. in startup, shutdown or checkpoint-ing state. The logical model is inherently intrinsic to theapplication and can only be observed and interpreted fullywithin its confines. In contrast, the physical model andstate of an application captures the executing entities thatcomprise a running application: OS processes and threads.Process state (also called context), informs about the local-ity, resource access and utilization of a process. This in-cludes memory consumption, network and filesystem I/O.Together, the physical and the logical state make up theoverall state of an application.

Intuitively, logical state determines the physical state ofan application. However, changes in the physical state (e.g.failure) can influence the logical state as well. The third im-portant aspect is the platform state, which also influencesthe physical state of an application (e.g. resource throttling,shutting-down). This means that we end up with two enti-ties that influence the physical state of the application withpotentially contradictory goals: the state of the platformand the application model.

Physical Application Model and InterfaceA physical application model should be able to capture thecurrent state and properties of the executable entities of anapplication with respect to their relevance to resilience andoptimization strategies of the application and the platform.A physical state interface should allow real-time extractionof the physical state and properties. As the physical state ofan application is the state of its comprising processes (andthreads), the interface should provide at least the follow-ing information (1) application processes and their locality(placement), (2) memory consumption, (3) CPU consump-tion, (4) filesystem I/O activity, (5) filesystem disk usage,(6) network activity (platform in-/outbound), and (7) inter-process network activity.

With this information available, a complete, real-time phys-ical model of the application can be drawn (see Figure 1).The results of changes to the logical application state and theplatform state can be observed within this model, hence itcan serve as the foundation for both manual and automatedapplication analysis and optimization, bottleneck and errordetection.

Real-time physical state information is not always trivialto handle and to interpret, especially not at large scales.While having the complete physical state information of anapplication available is crucial for advanced use-cases, it canturn out to be impractical or too complex to handle and

Physical application model

Node / locality (i)Node / locality (i)

Real-time, full-model interface

Subscription / notify interface

Process Process

HW

Res

.

File

sys.

HW

Res

.

File

sys.

(ii)(iii)

(iv)(v)

(vii) (vi)(vi)

Figure 1: Physical application model. Changes inlogical application state and platform state can beobserved within this model.

analyze for more basic use-cases and applications. Hence,we recommend that a physical state interface provides asubscription-based interface complementary to the real-time,full-model interface. A subscription-based interface allowsboundary conditions to be set for entities in the physicalmodel. If these boundary conditions are violated, a notifica-tion is sent to the application. e.g. an application might seta boundary condition for maximum memory consumption orminimum filesystem throughput and act accordingly if anyof these boundary conditions is violated, e.g. by adjustingthe simulation models or algorithms.

To allow a maximum degree of flexibility, the physicalstate interface of an HPC platform should not be confinedto the application itself, but should also be accessible by ahigher-level application frameworks and services, and by hu-man operators through interactive web portals and analysisand optimization tools.

Logical Application Model and InterfaceWhile the physical state model of an application is explicitlydetermined by the state of its individual operating systemprocesses, the logical state model of an application is a lotmore fuzzy, as it is largely defined by the application itself.The majority of logical application states will not be relevantor parseable by entities outside the application. Hence, werecommend an implementation of an extensible logical stateinterface that captures application states that are (a) rele-vant outside the application logic, and (b) generic enoughto be applicable to a large number of different HPC applica-tions. The logical application model interface is importantas it allows state information to flow from applications tothe platform and its management components, like sched-ulers, and accounting services. As a first approximation, werecognize the following logical application states

1. Running : executing normally2. Checkpointing : checkpointing current state3. Restoring : restoring from a checkpoint4. Idle: waiting for an external event5. Error : in a terminal state of failure

In addition to the application states, the the logical applica-

DRAFT

. . .

Compute Node

OS

LXCVNet

J J

C C

NativeJobs

Containers

Compute Node

OS

LXCVNet

J J

C C

NativeJobs

Containers

Metrics Database

cHPC APIJob

Scheduler(SLURM)

Message Bus (Apache Kafka)

J J

C C

AnalyticsEngine

Container Manager

. . .

. . .C

J C

Figure 2: The cHPC architecture combines existingHPC platforms with LXC and platform analytics.

tion model should have an optional notion of an application’srelative progress. With application state and progress infor-mation it is possible to make basic assumptions about theinternal state of the application. It can help the platformto (a) track the progress of an application, (b) determinea preferable time for application interruption (e.g. aftercheckpoints), and (c) make decisions about resource (re-)assignment and QoS by observing I/O-, compute-centricand idle states.

Platform Environment Model and InterfaceThe third and last sub-model that comprises our concept ofa more symmetric HPC platform model is the platform envi-ronment model. Analogous to the logical and physical appli-cation models, the platform environment model captures thestate and properties of the platform resources (hardware)and management services (schedulers, QoS) with respectto their relevance for implementing resilience and optimiza-tion mechanisms. The accompanying interface should againallow pull-based real-time, full-model extraction as well aspush-based subscription / notify access.

The platform environment model interface is relevant forthe platform management software as well as applications.For the platform, it provides self-introspection, for the appli-cation it provides environmental awareness. As a minimum,we recommend to capture the following per-node utilizationmetrics: (1) CPU, (2) memory, (3) network I/O bandwidth,(4) filesystem I/O bandwidth, and (5) storage. Platform en-vironment states reflect the platform state with respect tothe application:

1. Draining : resources are drained to prepare applicationtermination.

2. Terminating : the application is being terminated.3. Adjusting : resources are temporarily reduced or increased4. Freezing : application resources are temporarily withdrawn.

4.2 I/O Resource and Storage SchedulingIn order to address the requirements for a predictable

and stable execution environment for applications that workwith large and dynamic datasets, we recommend to makeI/O and storage first order resources in future HPC platformmodels. We suggest that network and filesystem bandwidth,as well as storage capacity become reservable entities, just

like CPU and memory. It should be possible for an appli-cation to reserve a specific amount of storage space and aguaranteed filesystem and inbound/outbound network I/Obandwidth. Schedulers would take these additional requestsinto account.

4.3 Isolated, User-Driven DeploymentLastly, to address the issues of software and application

deployment and mobility (migratability), we recommend thatfuture HPC platform models embrace a user-driven softwaredeployment approach and isolated software environments toprovide a more hospitable environment for second generationHPC applications, and equally important, a sustainable en-vironment for legacy applications. Legacy applications canbe equally affected when they slowly grow incompatible withcentrally managed libraries and compilers.

5. IMPLEMENTATIONIn this section we provide a brief outline of cHPC (con-

tainer HPC), our early prototype implementation of an HPCplatform architecture driven by the recommendations in sec-tion 4. We give a high-level overview of its architecture andimplementation and discuss how it complements existingplatform models, i.e., allows a non-disruptive, incrementalevolution of production HPC platforms.

5.1 The cHPC PlatformTo explore the implementation options for our new plat-

form model, we have developed cHPC, a set of operating-system level services and APIs that can run alongside and in-tegrate with existing job via Linux containers (LXC) to pro-vide isolated, user-deployed application environment con-tainers, application introspection and resource throttling viathe cgroups kernel extension. The LXC runtime and software-defined networking are provided by Docker 1 and run as OSservices on the compute nodes. Applications are submit-ted via the platform’s “native” job scheduler (in our caseSLURM) either as regular HPC jobs that are launched asprocesses on the cluster nodes, or as containers, that runsupervised by LXC. This architecture allows HPC jobs andcontainer applications to run side-by-side on the same plat-form, which allows direct comparison of the performanceand overhead.

To provide platform and application introspection (seeSection 4.1), container and node metrics are collected inreal time via a high-throughput, low-latency message bro-ker based on Apache Kafka 2 and streamed via the platformAPI service to one or more consumers. These can be auser, a monitoring dashboard or the application itself. Thedata is also ingested into a metric database for further pro-cessing and deferred retrieval. The purpose of the analyticsengine is, to compare the stream of platform data with thethresholds set by the applications (Section 4.1) through theplatform API and to send an alarm signal to all subscriberswhen it is violated.

A version of cHPC has been deployed on a virtual 128-coreSLURM cluster on 8 dedicated Amazon EC2 instances. Weare in the process of deploying cHPC on EPCC’s 3 24 node,1536-core INDY cluster for further evaluation.

1Docker: https://www.docker.com

2Apache Kafka: https://kafka.apache.org

3Edinburgh Parallel Computing Centre: https://www.epcc.ed.ac.uk

https://www.epcc.ed.ac.uk

DRAFT

5.2 Practical ApplicabilityWe believe that it is important to find a practical and

applicable way forward when suggesting any changes to theHPC platform model. Users and platform providers shouldbenefit from it alike and it should not disrupt existing ap-plications and usage modes. Our implementation blueprintfulfills these requirements. It is designed to be explicitly non-disruptive, the most critical aspect for its real-world applica-bility. Suggesting a solution that would disrupt the existingplatform and application ecosystem would be, even thoughconceptually valuable, not relevant for real-world scenarios.The key asset is the non-invasiveness of Docker’s operating-system virtualization. It is designed as an operating-systemservice accompanied by a set of virtual devices. Assuminga recent version of the operating system kernel, it can beinstalled on the majority of Linux distributions. Additionalrules or plug-ins can be added without disruption to existingjob managers to allow them to launch container applicationsthrough regular job descriptions files. The remaining com-ponents, introspection API services, metrics database andanalytics engine can run on external utility nodes.

This setup allows us to run containerized applicationsalongside regular HPC jobs on the same platform. Thisis suboptimal from perspective of adaptive resource man-agement and global platform optimization as regular HPCjobs cannot be considered for optimization by our system.However, it is a viable way forward towards a possible nextincremental step in HPC platform evolution. It also achievesour main goal, provide a more suitable, alternative platformfor applications that have difficulties existing within the cur-rent HPC platform model and policies.

6. THE ROAD AHEADMany of the platform model issues we discuss in this paper

are based on our own experience as well as the experienceof our immediate colleagues and collaborators. Also plat-form providers seem to have a tightened awareness of theseissues. In order to qualify and quantify our assumptions, weare in the process of designing a survey that will be sent outto platform providers and application groups to verify cur-rent issues on a broader and larger scale. The main focus ofour work will be on the further evaluation of our prototypesystem. We are working on a “bare metal” deployment onHPC cluster hardware at EPCC. This will allow us to carryout detailed measurements and benchmarks to analyze theoverhead and scalability of our approach. We will also en-gage with computational science groups working on secondgeneration applications to explore their real-life applicationin the context of cHPC.

7. REFERENCES[1] M. Atkinson et al. VERCE delivers a productive

e-Science environment for seismology research. InProc. IEEE eScience 2015, 2015.

[2] M. J. Berger and P. Colella. Local adaptive meshrefinement for shock hydrodynamics. Journal ofComputational Physics, 82:64–84, May 1989.

[3] Carrington et al. High-frequency simulations of globalseismic wave propagation using specfem3d globe on 62kprocessors. In High Performance Computing,Networking, Storage and Analysis, 2008., Nov 2008.

[4] J. Diaz-Montes et al. Cometcloud: Enablingsoftware-defined federations for end-to-end applicationworkflows. IEEE Internet Computing, 2015.

[5] S. Jha, H. Kaiser, Y. El Khamra, and O. Weidner.Developing large-scale adaptive scientific applicationswith hard to predict runtime resource requirements.Proceedings of TeraGrid, 2008.

[6] A. Merzky, M. Santcroos, M. Turilli, and S. Jha.Radical-pilot: Scalable execution of heterogeneous anddynamic workloads on supercomputers. CoRR,abs/1512.08194, 2015.

Documents

Rethinking High Performance Computing Platforms ... · T Rethinking High Performance Computing Platforms: Challenges, Opportunities and Recommendations Ole Weidner School of Informatics