Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Computer and Computational Sciences DivisionLos Alamos National Laboratory

Ideas that change the world

Achieving Usability and Efficiency in

Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov

Performance and Architectures Lab (PAL), CCS-3

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy2

CCS-3

PALSchedule

09.00-09.45 Introduction 09.45-10.00 Break 10.00-10.45 Existing Systems 10.45-11.00 Break 11.00-11.45 Case Study 11.45-12.00 Break 12.00-12.45 A New Approach


Italy3

CCS-3

PALPart 1: Introduction

1. The need for more capability

2. The big issues

3. A taxonomy of systems in three dimensions


Italy4

CCS-3

PALThe Need for More Capability

The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage, 1791-1871

Our interest is in scientific computing—large-scale, numerical, parallel applications run on large-scale parallel machines.


Italy5

CCS-3

PALDefinitions

Computing capacity: total deliverable computing power from a system or set of systems. (Power—rate of delivery)

Computing capability: computing power available to a single application.

Highest-end computing is primarily concerned with capability—why else build such machines?


Italy6

CCS-3

PALThe Need for Large-Scale Parallel Machines

It is the insatiable demand for ever more computational capability that has driven the creation of many Tflop-scale parallel machines (Earth Simulator, LANL’s ASCI Q, LLNL’s Thunder and BlueGene/L, etc.)

Petaflop machines are on the horizon, for example DARPA HPCS program (High Productivity Computing Systems)


Italy7

CCS-3

PALOne-upmanship?

Is this merely one-upmanship with the Japanese?

From The Roadmap for the Revitalization of High-End

Computing, Computing Research Association:

[…] there is a growing recognition that a new set of scientific and engineering discoveries could be catalyzed by access to very-large-scale computer systems—those in the 100 teraflop to petaflop range.


Italy8

CCS-3

PALRequirements for ASCI

In our own arena, Advanced Simulation and Computing (ASC)

for stockpile stewardship; climate, ocean, and urban

infrastructure modeling, etc.,

Within 10 years, estimates of the demand for Capability and general physics arguments indicate a machine of 1000TF=1 PetaFlop (PF) will be needed to execute the most demanding jobs. Such demand is inevitable; it should not be viewed, however, as some plateau in required Capability: there are sound technical reasons to expect even greater Capability demand in the future.


Italy9

CCS-3

PALLarge Component Count

Increases in performance will be achieved through single processor improvements and increases in component count

For example, BlueGene/L will have 133,120 processors and 608,256 memory modules

The large component count will make any assumption of complete reliability unrealistic

133,120 processors 608,256 DRAM


Italy10

CCS-3

PALSensitivity to Failures

In a large-scale machine a failure of a single component usually causes a significant fraction of the system to fail because

1. Components are strongly coupled (e.g., a failure of a fan will lead to other failures due to overheating)

2. The state of the application is not stored redundantly, and loss of any state is catastrophic

3. In capability mode, many processing nodes are running the same application, and are tightly coupled together


Italy11

CCS-3

PALThe Need for Transparent Fault-Tolerance

System software must be resilient to failures, to allow continuing execution of in the presence of failures

Most of the investment is in the application software (250M$/year for MPI software in the ASCI TriLabs)

Economical constraints impose a limited level of redundancy

Other considerations include cost of development, scalability and efficiency


Italy12

CCS-3

PALThe JASON’s Report

A recent report from the JASON’s, a committee of distinguished scientists chartered by the US government, raised the sensitive question of whether ASCI machines can be used as capability engines

For that to be possible, major advances in fault-tolerance are needed

The recommendation of the report is to skip one generation of supercomputers, due to the lack of good technical/scientific solutions


Italy13

CCS-3

PALMTBF as a Function of System Size


Italy14

CCS-3

PALFailure Distribution (ASCI Blue Mountain)


Italy15

CCS-3

PALState of the Art in Large-Scale Supercomputers

We can assemble large-scale systems by wiring together hardware and “bolting together” software components

But we have almost no control on the machine: not only faults but also performance anomalies


Italy16

CCS-3

PAL1.2 The Big Issues

From DoE Office of Science

By the end of this decade petascale computers with thousands of times more computational power than any in current use will be vital tools for expanding the frontiers of science and for addressing vital national priorities. These systems will have tens to hundreds of thousands of processors, an unprecedented level of complexity, and will require significant new levels of scalability and fault management. [Emphasis added]


Italy17

CCS-3

PALOffice of Science cont’d

Current and future large-scale parallel systems require that such services be implemented in a fast and scalable manner so that the OS/R does not become a performance bottleneck.

Without reliable, robust operating systems and runtime environments the computational science research community will be unable to easily and completely employ future generations of extreme-scale systems for scientific discovery.


Italy18

CCS-3

PALDARPA

Defence Advanced Research Projects Administration (DARPA) High Productivity Computing Systems (HPCS) mission:

Provide economically viable high productivity systems for the national security and industrial user communities with the following design attributes in the latter part of this decade:

• Performance

•Programmability

•Portability

•Robustness


Italy19

CCS-3

PALOur Translation

Performance—achieving achievable performance (not, e.g., some percentage of theoretical peak)

Programmability/portability—standard interfaces, transparency of mechanisms for fault tolerance

Robustness—graceful failover


Italy20

CCS-3

PAL1.3 A Taxonomy of Systems

Q: Is it a supercomputer or just a cluster?

A: It is a continuum along multiple dimensions.

A taxonomy of systems of three dimensions: Degree of integration of compute node; Collective primitives provided by the network interface,

programmability, global address space; Degree of integration of system software.


Italy21

CCS-3

PALNote

This taxonomy is useful for our explication, but we make no claims that it

is canonical, that it captures highly specialized architectures (for

example custom-designed special-purpose digital processors, vector processors, floating-point processors).

We are concerned with the big `general purpose’ parallel machines.


Italy22

CCS-3

PALCompute Node

Degree of integration of compute node between

processors, memory, and network interface Single processor—SMP—multiple CPU cores per

chip Number of levels of cache, proximity of caches to

CPU core Proximity of network interface to CPU core: on-chip

—off-chip direct connection—separated by I/O interface


Italy23

CCS-3

PALNetwork Interface

Collective primitives provided by network interface: none—functionally rich;

Programmability of network interface: none—general purpose

Provision of virtual global address space


Italy24

CCS-3

PALSystem software

Degree of integration of system software

much more about this later…

Documents

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems