System Software for Exascale Platforms · Decrease barriers to entry with new programming models! Create next-generation tools to understand the performance behavior of an exascale

March 10, 2011Aachen, Germany

System Software for Exascale Platforms

Barney MaccabeComputer Science and Mathematics DivisionOak Ridge National Laboratory

Friday, March 11, 2011

Exascale Systems Research Activities at ORNL

•Exascale Research Kickoff Meeting• San Diego, California; March 7-10

• http://exascaleresearch.labworks.org/ascr2011

•Scott Klasky

•Jeff Vetter

•David Bernholdt

•Stephen Scott


http://exascaleresearch.labworks.org/ascr2011

http://exascaleresearch.labworks.org/ascr2011

3

In-situ Data Reduction and Analysisfor Extreme Scale Science

ApproachObjectives

§ Create a robust I/O staging framework for in situ analysis and reduction of extreme-scale application data

§ Create and extend programming models for end scientists to utilize in situ I/O stream processing

§ Create a toolkit of I/O modules such as FastBit indexing that scientists can easily utilize

§ We leverage existing multi-year effort in software tools for scientists§ ADIOS for I/O abstraction in code, DataTap from GT for staging

area, FastBit from LBNL for indexing, Parallel R for analysis

§ We will provide run-time parameterization of ADIOS methods through a control layer

§ We will introduce a PGAS-like interface for staging area processing to simplify for scientists

Scenario of in situ I/O Pipeline analysis and visualization for fusion simulation data

Scott Klasky, ORNLPodhorski, Samatova, Wolf, ORNL; Shosani,

Wu, LBNL; Schwan, GT

§ ADIOS is already used by many extreme-scale science codes

§ Extensions to ADIOS will be immediately inherited

§ Provide future-proofing for scalable I/O needs

§ Initial evaluation systems already deployed for codes running at over 100k cores (Fusion, combustion)

Impact

FWP # ERKJU60


4

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems

ApproachObjectives

§ Propose new distributed computer architectures that address the resilience, energy, and performance requirements of future DOE exascale systems:§ replace mechanical-disk-based data-stores with energy-efficient

non-volatile memories;

§ place low power compute cores close to the data store;

§ reduce number of levels in the memory hierarchy.

§ Evaluate the impact of the proposed architectures on the performance of critical DOE applications.

Jeffrey Vetter, ORNLRobert Schreiber, HP Labs

Trevor Mudge, University of Michigan Yuan Xie, Penn State University

Impact

FWP #ERKJU59

A comparison of various memory technologies

§ Address energy scalability of future exascale systems; NV memories have zero standby power.

§ Increase system reliability; MRAM/PCRAM are resilient to soft errors.

§ Develop new programming models that exploit NVMs to improve the fault-tolerance of applications.

§ Identify and evaluate the most promising non-volatile memory (NVM) technologies.

§ Explore assembly of NVM technologies into a storage and memory stack.

§ Propose an exascale HPC system architecture that builds on our new memory architecture.

§ Build the abstractions and interfaces that allow software to exploit NVM to its best advantage.

§ Characterize key DOE applications and investigate how they can benefit from these new technologies.


5

Vancouver: A Software Stack for Productive Heterogeneous Exascale Computing

ApproachObjectives

§ Enhance programmer productivity for the exascale§ Increase code development ROI by enhancing code

portability

§ Decrease barriers to entry with new programming models

§ Create next-generation tools to understand the performance behavior of an exascale machine

§ Programming tools§ GAS programming model§ Analysis, inspection, transformation

§ Software libraries: autotuning

§ Runtime systems: scheduling

§ Performance tools

§ Impact on DOE Applications

The proposed Maestro runtime simplifies programming heterogeneous systems by unifying OpenCL task queues into a single high-level queue. § Reduced application development time

§ Ease of porting applications to heterogeneous systems

§ Increased utilization of hardware resources and code portability

Impact

ERKJU44

Jeffrey Vetter, ORNLWen-Mei Hwu, UIUC

Allen Malony, University of OregonRich Vuduc, Georgia Tech


6

COMPOSE-HPC: Software Composition for Extreme Scale Computational Science and Engineering

ApproachObjectives

§ Develop a flexible, extensible toolkit to help software developers address various kinds of software composition challenges

§ Provide examples of how the toolkit can be applied to specific composition problems

§ Develop new approaches to facilitate composition of parallelism (threads and processes)

§ Develop the Knot Nimble Orchestration Toolkit (KNOT), consisting of three components:§ An annotation parsing facility (PAUL) to interpret guiding

annotations embedded as comments in user source code

§ A transformation facility (ROTE) to apply source-to-source transformations to the code, based on annotations and other inputs

§ A code generation facility (BRAID) capable of manipulating compiler-like intermediate representations, transforming, optimizing, and generating source code based on the results

§ Develop examples of using KNOT to address different composition problems: language interoperability, contract enforcement, automatic performance instrumentation, data marshalling for GPUs

§ Build on this infrastructure to facilitate the composition of parallelism§ Allow codes to express their preferred threaded

execution model, and call modules with different threading models

§ Simplifying the expression and exploitation of MPMD parallelism

David E. Bernholdt, ORNLTom Epperly, LLNL

Manoj Krishnan, PNNLMatt Sottile, Galois

Rob Armstrong, SNL

§ Practical tools to help software developers produce higher quality code that works effectively in modern HPC environments

§ Bring the capabilities of code transformation and code generation to bear on the challenges of software composition

Impact

ERKJU68

A schematic of the KNOT tool chain


7

Enabling Exascale Hardware and Software Design Through Scalable System Virtualization

ApproachObjectives § Provide a novel solution for testing at scale

§ Ease the transition to production by supporting scaling of legacy system software

§ Enable advanced research§ Architecture research toward exascale§ New parallel programming models§ System software research§ Resilience

§ Extend the Kitten/Palacios prototype§ Support for modern hardware§ Port to HPC operating systems§ Integration of system management tools

§ Design & implement new capabilities§ Integration with micro-architectural simulator§ Binary translation for the emulation of new hardware§ Fault injection

Kitten/Palacios Architecture: Virtualization for Exascale

§ Provide a test bed solution for exascale§ Vertical profiling§ Fault injection

§ Provide a platform for exascale research§ System architecture research§ Programming languages research§ System software research§ Resilience

Impact

ERKJU70

Stephen L. Scott, ORNLPatrick Bridges, UNM

Peter Dinda, NWUKevin Pedretti, SNL


Operating Systems Research:Collaboration

•OS meeting held in Phoenix Arizona January 19, 2011

•Participants• Pete Beckman (ANL)

• Ron Brightwell (SNL)

• Kamil Iskra (ANL)

• Larry Kaplan (Cray)

• Barney Maccabe (ORNL)

• Ron Minnich (SNL)

• Marc Snir (UIUC)

• Bob Wisniewski (IBM)


9

•Management of protection and capabilities• isolation is the key challenge

•First level handling of “rule breaking”• interrupts, traps, exceptions, and fauls

• Common mechanisms and services:• Resource management mechanisms – not

necessarily policies• Instrumentation/introspection• Reliable and scalable communication and control

protocols• External communication• …

What is Operating Systems Research?


•Concurrency • O(1B) threads• O(1k—10k) threads per node

•Heterogeneity• Different types of cores• Non-coherent shared memory• Deeper memory hierarchies

•Energy constraints and power management

•Evolving balance between compute, memory, and communication

•More frequent failures

•More complex applications• Dynamic, data-dependent

algorithms; multiscale, multiphysics

•More complex software• Python, C++, Fortran, OpenMP,

MPI, libs, frameworks; run-time adaptation

•Legacy compatibility and toolchain support (TCP/IP…)

•Added HW functionality • E.g., better HW support for

protection

Driving Forces


Increasing importance of effectively managing increasingly complex resources

•Increasing complexity• Memory hierarchy becoming deeper• New technologies, e.g., SSD • Variety of computing resources

•Increasing importance• Parallelism increasing in cores/node and

number of nodes• Scaling efficiency is even more critical• Strong scaling will become more

important• Power management will be a real

consideration• Resilience in the presence of component

failures must be addressed


12

•Focus on tool developers, runtime/middleware/library writers and subsystem developers (I/O, Viz) not application developers

•Sustainability is a key consideration• Needs to be accepted by vendors and community

•Flexibility to handle change• X86/PPC today; ARM tomorrow? (think about revolution)

•Co-design opportunities:• Customers: Tools, libraries, programming models, subsystems

(I/O, Viz)

• Suppliers: HW architecture

Overall Considerations


iOS-4 as an example of co-design

•Multitasking has potential to• drain battery life and• interfere with foreground application

•iOS4 introduced multitasking• specialized services generally available:

• complete tasks (downloads),• continue where you left off (games)• receive push and local notifications

• general services specially available:• listen to music• run your GPS• get VOIP calls

We looked at 1000’s of apps. This is what they needed

“Hardware and software made for each other.”


14

Nominal Architecture

• Enclave: Transitory parallel job or persistent parallel subsystem (e.g., parallel file system, external gateways, monitoring subsystem)

• Little/no time sharing of resources – enclaves are non-interfering (performance, power management) • But different enclaves could reside on different cores of same node

• Specialized resources (performance, power)• Global/enclave level/node level OSR (implemented via node level

software) – more functionality than today (resilience, power, complex applications…)

• Enclaves could be hierarchical

…

SystemEnclave

Node Node…

Enclave

Node Node


15

•Both evolutionary and revolutionary approaches are expected to be necessary.

•Evolutionary: low-risk research/improvement of existing technology areas based on past/current experiences and utilizing known techniques• Might get us to the “gen 1” system

•Revolutionary: higher-risk research in new areas, requiring more innovative solutions.• Expected to be required for a successful “gen 2” system

•Note: either approach is meant to result in production-quality software.

Evolutionary vs Revolutionary


16

•Traditional vendor supplied hypervisor/kernel on each node for setup and “legacy services”

•Run-time services provided either by libraries (e.g., thread-scheduling) or by kernel• Modified Linux or LWK

•Added kernel functionalities:• Contiguous memory, large pages, improved scheduling, I/O

forwarding

•All global functions (resource manager, control infrastructure for tools,…) built atop standard distributed computing protocols (TCP/IP, sockets, rsh, kerberos…)

Evolutionary Track


17

•Vendor kernel/hypervisor SW used for node bootstrapping• Limited to a minority of cores (the system cores) • May include a traditional Linux kernel for legacy services

•Lightweight runtime system with limited functionality runs on the majority of cores (the compute cores)• Synchronization, communication, lightweight thread scheduling• Offloads complex operations to the system cores

•Global services can depend on new HW capabilities:• New communication protocols (rDMA, global communication)• More sophisticated protection HW (more rings, capability-carrying

packets, privileged RMI, etc.)

•Global services can be layered atop global communication services

Revolutionary Track


18

•Pub-sub for failure events• Failures can be reflected back to enclave RT

• Out-of-band monitoring infrastructure has more than one consumer!

• Includes passive monitoring and active monitoring (e.g., heartbeats)

•Pub-sub for performance sensors (including access to HW sensors – energy, temperature) and instantiation and control of SW sensors

•Reliable & scalable Command and control infrastructure • Resource manager, debugger, run-time…

•Parallel coupling and communication protocols between enclaves (workflows, I/O, multi-module codes)

Example of Global Communication Services


19

•High cost of synchronization/on-node communication• Inefficient support for producer-consumer synchronization and

collective synchronization

•Move towards NUMA requires careful mapping• Affinity to CPU and memory

•Heterogeneous cores

•Co-design:• Synchronization HW

• Virtualization (in the sense of hiding fixed amount of HW)

• Same or different ISAs on heterogeneous cores?

Node Concurrency Challenges


20

•Evolutionary• Improved synchronization/IPC

• Advanced affinity

•Revolutionary• Tight coupling with new HW mechanisms for

• synchronization

• event/message driven thread scheduling, e.g., “active messages” (user-level interrupt table?)

• Support for heterogeneous cores

• Support for asymmetric kernel

Node Concurrency Technologies


21

•Inappropriate support for large working sets

• NUMA

• Non-coherent shared memory

• Integration of NV-RAM in memory hierarchy• Support for “out of core” codes, e.g., explicit migration to NV-

RAM?

• Leveraging of 64 bit address space

• Integration of remote memory

Memory Challenges & Opportunities


22

•Evolutionary• NUMA localized memory management

• Management of coherence domains

• Support for contiguous memory regions

• if required by hardware

• Very large pages

• 2 MB “huge” pages are tiny

• transparently accessible to user-space

Memory Technologies


23

•Revolutionary• Incorporating NV-RAM

• turn DRAM into an L4 cache?

• make NV-RAM a swap space for paging?

• Get rid of paging? Go back to Base&Bound?

•Exploratory• Opportunities of a very large address space

• map memory from all nodes?

• Non-traditional memory management

• non-contiguous heap

• memory regions with virtual = physical addressing

Memory Technologies (cont’d)


24

•Has different time scales• Microseconds: uncorrectable data errors, node loss, etc.

• Months: want to revive an app and rerun for some reason

•Today we do the same for all ranges:• Global checkpoint/restart

•Works because uptimes are so good (> 1 week)

•Because of potential (energy) cost of low MTBF, we need mechanisms for alternative approaches

Resilience Challenges


25

•Checkpointing at various granularities• Thread/node/enclave

•Various levels of storage persistence• more/less information dispersal; more/less redundance

• atop NV-RAM and/or disk

• Persistence hierarchy

• Logging services• Various levels of communication reliability• Scalable & reliable pub-sub infrastructure for failure events

(HW, timeouts, run-time defined checks…)•Scalable & reliable coordination protocols

Mechanisms for Resilience


26

•Evolutionary:• Checkpointing run-time (with user/compiler/runtime help)

• Differentiated communication reliability levels

• global services built atop TCP/IP

•Revolutionary:• Integration of in-band and out-of band error detection and

signaling

• General pub-sub infrastructure for failure events

• Differentiated storage reliability services

Resilience Technologies


27

Power Technologies

•Goal: Promote power to a first class resource• along CPU, memory

•Need power allocation infrastructure (at enclave and node level)

•Need power sensor information reflected to different resource management layers

• Need new HW capabilities• low-overhead suspend/resume of cores (ACPI too expensive)

• Measured in cycles, not seconds

• Notification of power profile changes


Barney’s pet peeve: Linux(we’re avoiding the hard conversation)


Linux is the dominant OS on the Top 500

•Lots of things happening in late 1990’s

•Extreme Linux Forum

•Linux Cluster Institute

•LANL Linux clusters

•Sandia Cplant

•Los Lobos and the original Roadrunner at UNM

29


Linux has been a great compute node OS for clusters

•Why Linux?• Beowulf clusters: many people could build a supercomputer

• Cplant, Extreme Linux Forum, Linux Cluster Institute

•At least it wasn’t NT!• Desktop, not network oriented

• Unfamiliar/unsupported programming environment (limited tools)

•The goals of Linux and the needs of Exascale applications are divergent

•We need a real revolution

TUTORIALS CO-CHAIR: Valerie Taylor, Northwestern UniversityTUTORIALS CO-CHAIR: Michelle Hribar, Pacific University

This year’s tutorials include exciting offerings in new topics, such as mesh

generation, XML, parallel programming for clusters, numerical computing inJava, and management of large scientific datasets, along with the return ofsome of the most requested presenters from prior years, with new and

updated materials. In addition, we offer some of the full day tutorials as two

half-day tutorials (denoted by the numberings with a and b), thereby

increasing the number of half-day tutorials. We have a total of 9 full-day and

15 half-day tutorials covering 20 topics. Attendees also have the opportuni-ty for an international perspective on topics through the tutorials on large

data visua lization, cluster computing, performance ana lysis tools, and

numerical linear algebra. Separate registration is required for tutorials; tuto-rial notes and luncheons will be provided on site (additional tutorial noteswill be sold on site). A One- or Two- day Tutorial Passport allows attendeesthe flexibility to attend multiple tutorials.

Using MPI-2: A Tutorial on Advanced Features of the Message-PassingInterface

William Gropp, Ewing “Rusty” Lusk, Rajeev S.Thakur, Argonne National Laboratory20% Introductory | 40% Intermediate | 40% Advanced

This tutorial will discuss methods for using MPI-2, the collection of advanced

features that were added to MPI (Message-Passing Interface) by the second

MPI Forum. These features include parallel I/O, one-sided communication,dynamic process management, language interoperability, and some miscel-laneous features. Implementations of MPI-2 are beginning to appear. A few

vendors have completed implementations; other vendors and research

groups have implemented subsets of MPI-2, with plans for complete imple-mentations. This tutorial will explain how to use MPI-2 in practice, particu-larly, how to use MPI-2 in a way that results in high performance. We willpresent each feature of MPI-2 in the form of a series of examples (in C,Fortran, and C+ +), starting with simple programs and moving on to more

complex ones. We assume that attendees are familiar with the basic mes-sage-passing concepts of MPI-1.

An Introduction to High Performance Data MiningRobert L Grossman, Magnify, Inc. and U of Illinois at Chicago, Vipin Kumar, U of Minnesota50% Introductory | 30% Intermediate | 20% Advanced

Data mining is the semi-automatic discovery of patterns, associations,changes, anomalies, and statistically significant structures and events indata. Traditional data analysis is assumption-driven in the sense that ahypothesis is formed and validated against the data. Data mining, in con-trast, is discovery-driven in the sense that patterns are automatically extract-ed from data. The goal of the tutorial is to provide researchers, practitioners,and advanced students with an introduction to data mining. The focus willbe on basic techniques and algorithms appropriate for mining massive datasets using approaches from high performance computing. There are now

parallel versions of some of the standard data mining algorithms, including

tree-based classifiers, clustering algorithms, and association rules. We willcover these algorithms in detail as well as some general techniques for scal-ing data mining algorithms. In addition, we will give an introduction to some

of the data mining algorithms, which are used in the recommended systemsthat are becoming important in e-business. The tutorial will include severalcase studies involving mining large data sets, from 10-1000 Gigabytes insize. The case studies will be from science, engineering and e-business.

Design and Analysis of High Performance ClustersRobert Pennington, NCSA, Patricia Kovatch, Barney Maccabe, David Bader, UNM25% Introductory | 50% Intermediate | 25% Advanced

The National Computational Science Alliance (the Alliance) has created sev-eral production NT and Linux superclusters for scientists and researchers to

run a variety of parallel applications. The goal of this tutorial is to bring

together researchers in this area and to share the latest information on the

state of high-end commodity clusters. We will discuss details on the design,implementation and management of these systems and demonstrate some

of the current system monitoring and management tools. A wide variety ofapplications and community codes run on these superclusters. We willexamine several of these applications and include details on porting appli-cations and application development tools for NT and Linux. We will also dis-cuss how to use these tools to tune the system and applications for optimalperformance.

High Performance Numerical Computing in Java: Compiler, Language,and Application SolutionsManish Gupta, Samuel P. Midkiff, Jose E. Moreira, IBM T.J. Watson Research Center15% Introductory | 65% Intermediate | 20% Advanced

There has been an increasing interest in using Java for the development ofhigh performance numerical applications. Although Java has many attrac-tive features—includ ing re liab ility, portab ility, a clean ob ject-oriented

model, well defined floating point semantics and a growing programmerbase—the performance of current commercial implementations of Java innumerical applications is still an impediment to its wider adoption in the

performance-sensitive field. In this tutorial we will describe how standard

libraries and currently proposed Java extensions will help in achieving high

performance and writing more maintainable code. We will also show how

Java virtual machines can be improved to provide near-Fortran performance.The proposals of the Java Grande Forum Numerics Working Group, which

include a true multidimensional array package, complex arithmetic and new

floating point semantics, will be discussed. Compiler technologies to be

addressed include array bounds and null pointer check optimizations, aliasdisambiguation techniques, semantic expansion of standard classes, and the

interplay of static and dynamic models of compilation. We will also discussthe importance of language, libraries and compiler codesign. The impact ofthese new technologies on compiler writers, language designers, and appli-cation developers will be described throughout the tutorial.

Performance Analysis and Tuning of Parallel Programs:

Resources and ToolsBarton Miller, U of Wisconsin-Madison, Michael Gerndt, Technical U Munich, Germany,Bernd Mohr, Research Centre Juelich, Germany50% Introductory | 25% Intermediate | 25% Advanced

This tutorial will give a comprehensive introduction into the theory and

practical application of the performance analysis, optimization, and tuning

of parallel programs on currently used high-end computer systems like the

IBM SP, SGI Origin, and CRAY T3E as well as clusters of workstations. We willintroduce the basic terminology, methodology, and techniques of performance

analysis and give practical advice on how to use these in an effective man-ner. Next we describe vendor, third party, and research tools available forthese machines along with practical tips and hints for their usage. We show

how these tools can be used to diagnose and locate typical performance

bottlenecks in real-world parallel programs. Finally, we will give an overview

of Paradyn, an example of a state-of-the-art performance analysis tool that

can be used for parallel programs of today. The presentation will include the

Performance Consultant that automatically locates the performance bottlenecksof user codes. This presentation will be concluded with a live, interactive

demonstration of Paradyn.

Mesh Generation for High Performance Computing Part I: An Overview of Unstructured and Structured Grid Generation TechniquesSteven J. Owen, Patrick Knupp, Sandia National Laboratories100% Introductory | 0% Intermediate | 0% Advanced

Mesh generation plays a vital role in computational field simulation for high

performance computing. The mesh can tremendously influence the accuracy

and efficiency of a simulation. Part I of this tutorial will provide an overview

of the principal techniques currently in use for constructing computationalgrids for both unstructured and structured techniques. For unstructured

techn iques, De launay, a dvancing front and octree methods w ill b e

described with respect to triangle and tetrahedral elements. An overview of

S1

4

SC2000 TUTORIAL PROGRAM

www.sc2000.org/tutorials

SUNDAY FULL DAY

S3

S4

S2

S5SUNDAY HALF DAY—AM

S6A


Linux and the key Exascale challenges

•Memory• Unlikely to explore (or support) HPC

uses of new technologies, e.g., SSD• refused to support for big phys area

• Zeptos removes the standard memory management

• Application resilience• only critical for extreme scale platforms• Why should Linux support resilience?

•Concurrency• Linux won’t dedicate a node to a single

application•Power issues may be supported

We’ll likely repeat the “OS Bypass” exampleFriday, March 11, 2011

Enabling the Revolution

•The OS community needs to define OS API for HPC• Zeptos and CNL don’t support most of the Linux API

• did they throw away the same parts?• Linux is largely a starting point – what’s the destination

• Probably won’t include goofy calls like fork and exec•This API needs to

• Support HPC tools and runtime environments• Run easily on Linux clusters

• Linux won’t go away and tool/runtime developers need to have the HPC OS API everywhere

• Support a virtualization layer that is capable of running an operating system of your choice, including Linux• Support application migration• Emphasis on mechanism, not policy


Palacios Virtual Machine Monitor

•Collaboration between: UNM, Northwestern, SNL, and ORNL

•Palacios handles all of the process/guest OS activites

•Current efforts focused on HPC communication layers

Bare Hardware

Host OS

Palacios

Guest OS Linux

Linux Linux

Kitten

Kitten

Linux

Kitten

HPC App


Thanks


Documents

System Software for Exascale Platforms · Decrease barriers to entry with new programming models! Create next-generation tools to understand the performance behavior of an exascale