Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense

Failure, Resilience, Opportunity and Innovation

John Daly, U.S. Department of Defense Salishan High Speed Computing Conference

April 27, 2015

Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future

http://en.wikipedia.org/wiki/Tandem_Computers

Resilience at Scale


Resilience at Scale“Diversity jolts us into cognitive action in ways that homogeneity simply does not…. For this reason, diversity appears to lead to higher-quality scientific research.” - Scientific American, Volume 331, Issue 4, 2014.

“If you always do what you always did; you will always get what you always got.” - Albert Einstein

Innovation and Failure

https://www.youtube.com/watch?v=iJAq6drKKzE


Scientific – Bell Labs Relay Calculator

http

://w

ww

.com

pute

rhis

tory

.org

/revo

lutio

n/bi

rth-o

f-th

e-co

mpu

ter/4

/85/

342

Scientific – Bell Labs Relay Calculator

“Starting with the Model III delivered to the Armed Forces in 1944, not one of our customers has reported their computers giving out a wrong answer as the result of a machine error.”

- Second Symposium on Large Scale Digital Calculating Machinery, 1949.

http

://w

ww

.com

pute

rhis

tory

.org

/revo

lutio

n/bi

rth-o

f-th

e-co

mpu

ter/4

/85/

342

Bi-quinary notation

Business – UNIVAC I (1951)• 5200 vacuum

tubes • 29,000 lbs • 125 kW • 2.25 MHz

clock • 66 hours

mean time to system failure 470 million instructions per hard stop

http://commons.wikimedia.org/wiki/File:UNIVAC-I-PRL61-0977.jpg

http://en.wikipedia.org/wiki/Colossus_computer

Intel – Colossus (1944)

"speed was the essence”

- Dorothy Du Boisso

n

“I was instructed to destroy all the records, which I did. I took all the drawings and the plans and all the information about Colossus on paper and put it in the boiler fire. And saw it burn.” - Tommy Flowers

• 2400 vacuum tubes

• ??? lbs • ??? kWatts • 5.8 MHz clock

• ??? MTTF

A call to confront faults scientifically

John von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components”, from a lecture delivered at the California Institute of Technology, January 1952

Rapid pace of innovation385 instructions per second

http://en.wikipedia.org/wiki/ENIAC

1945 – ENIAC becomes first general purpose electronic computer 1947 – Bardeen, Brattain and Shockley develop transistor at Bell Labs 1947 – Richard Hamming develops codes for error correction and detection 1954 – IBM 608 becomes first commercial all-transistor calculator

Fault-tolerance is about redundancy• In space…

– Hardware• Prediction and migration• Detection and standby

sparing• Modular redundancy

– Software• Redundant copies of code• Redundant versions of

code

• In time…– Coding techniques

• Error correcting codes (data)

• Residue codes (arithmetic)

Fault-tolerance is about redundancy• In space…

– Hardware• Prediction and migration• Detection and standby

sparing• Modular redundancy

– Software• Redundant copies of code• Redundant versions of

code

• In time…– Coding techniques

• Error correcting codes (data)

• Residue codes (arithmetic)– Recovery blocks

• Checkpoint and rollback• Transactional programming• Fault containment domains

Hardware / software reliability

Fault ErrorA

ctiv

atio

nFailure

Prop

agat

ion

Underlying system state is erroneous

Delivered service deviates from specified service

System state observed to be erroneous

Hardware / software reliability

Fault ErrorA

ctiv

atio

nFailure

Prop

agat

ion

Fault Latency Error Latency

Error latency cannot be measured on a real system

Tandem NonStop VLX (1986)• 2-16 procs • 256 MBytes • 12 MHz

• 240,000 hours mean time to system failure

Non-Stop II System (1981)

“Unlike the situation with hardware components, it is possible to develop perfect, defect-free, failure proof software. It is only a matter of cost to the manufacturer and inconvenience to the customer who must wait much longer for some needed software to be delivered.”

- Bartlet, et al., Fault Tolerance in Tandem Computer Systems, 1990.


Tandem NonStop VLX (1986)• 2-16 procs • 256 MBytes • 12 MHz

• 240,000 hours mean time to system failure

Non-Stop II System (1981)

Fail-Fast O

peration = fault-i

ntolerant


2.6 peta instructions per hard stop

(>1,000,000x in 35 years)


ASCI Q (2002)• ES-45 cluster

• 4096 cores

• 10 Tflops • 6.5 hours

MTBF

http://ya-ru.ru/10-samyx-bystryx-kompyuterov-v-mire

23 Pflops per hard stop

(<10x in 17 years)

U N C L A S S I F I E D


Operated by the Los Alamos National Security, LLC for the DOE/NNSA LA-UR-07-4292/5853/6490

Slide 5

Defining Solve Efficiency in Terms of How the System is Spending its Time*

€

Solve Efficiency =tstr⋅trtop

=tstop

Functional Status KeyFully-

Functionalt

Partially-Functional

t!

Non-Functional

t

OperationTimetop

IntegrationTimetint

TotalTimettot

DefensiveIO Time

tdio / t!dio

RestartTime

trst / t!rst

ReworkTime

trwk / t!rwk

ComputeTime

tcmp / t!cmp

ProductiveIO Time

tpio / t!pio

SolveTime

ts / t!s

Fault TolerantTime

tf / t!f

ArchivalStorage Time

ta / t!a

ProductionTime

tpr / t!pr

UnscheduledDowntime

tusch

ScheduledDowntime

tsch

ExternalFailure Time

tef

InternalFailure Time

tif

ExternalUnusable Time

teu

InternalUnusable Time

tiu

ReservedTime

trsv / t!rsv

IdleTime

tidle / t!idle

RunTime

tr / t!r

Application States

Inspired by Jon Stearley (based on SEMI-E10)J. Stearley. Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS). In Proceedings of the Linux Clusters Institute Conference, 2005. See http://www.cs.sandia.gov/~jrstear/ras.

* Proposed in collaboration with S. Michalak (LANL) and L. Davey (LANL)

What are we trying to measure?

http://ascii.jp/elem/000/00/982/982434/index-4.html

Red Storm (2005)



Operated by the Los Alamos National Security, LLC for the DOE/NNSA LA-UR-07-4292/5853/6490

Slide 3

Operations Rate Only Tells Part of the Story: Red Storm From The Application�s Perspective

5000 Node Job Daily Availability, 7-Day Average MTTI, and Efficiency

(Cumulative Availability = 60% and Cumulative Efficiency = 63%)

80%

70%

70%

67%

68%

64%

64%

59%

58%

56%

52%

41% 46%

39%

63%

58%

55%

62%

61%

62%

65%

62% 67%

55%

70%

54%

10.7

7.2

8.2

7.2

8.5

8.3

6.6

5.6

5.7

5.1

3.9

3.3

3.0

2.6

4.2

3.8

3.7

4.6

4.0 4.5

5.4

4.3

5.2 5.5 6.1

7.6

0

4

8

12

16

20

24

01/14/0

6

01/21/0

6

01/28/0

6

02/04/0

6

Nu

mb

er

of

Inte

rr

up

ts o

r T

ime

in

Ho

urs

Production Availability

System Interrupts

Application Interrupts

Runtime Efficiency

MTTI (System Only)

MTTI (Sys + App)

• 10,000 cores

• 36 Tflops • < 10 hours MTBF

(early adopter)

Ah, Checkpoint RestartCheckpointing Efficiency and the Optimum Checkpoint Interval as

Functions of the Dump Time, System MTBI, and Restart Overhead

0.00

0.25

0.50

0.75

1.00

0.01 0.1 1 10

Tsolv

e /

Tw

all

0.01

0.1

1

10

100

t c /

M

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

!

2 td

/M

R / M

!

t c"

2 t d

M

!

t c"M

+t d

Silent Data Corruption

CC BY-NC 4.0

Karl-Heinz Winkler (2006)

“Resilience = keep the application going at scale, despite component failures”

Karl-Heinz Winkler (June, 2007)

National HPC Workshop (2009)

Some more reliability data*

*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf






Why are fault rates rising?• Number of components is going up which will increase hard and

soft faults• Smaller circuit sizes, running at lower voltages to reduce power

increase the impact of thermal noise and radiation induced faults• Power management cycling significantly decreases component

lifetimes due to thermal and mechanical stresses• Resistance to adding additional detection and recovery logic on

the chip because of additional power consumption and chip costs• Heterogeneous systems make error detection and recovery even

harder• Increasing system and algorithm complexity makes faulty

interaction of components more likely

Thanks to Al Geist (ORNL) and Sudip Dosanjh (LBNL)

Fault Classification• Type

– Permanent/Hard – continuous and stable events on the system – Intermittent/Soft – occasional events, cause intrinsic to system – Transient/Soft – occasional events, cause extrinsic to system

• Extent – Single-event – independent events that alter only a single

component of system hardware or software state – Multi-event/common cause – correlated events that alter more

than one component of system state

Defining resilience• “The persistence of service delivery that can

justifiably be trusted, when facing changes.” (LaPrie, 2008)

• “The persistence of performability when facing changes.” (Meyer, 2009)

• “The ability of a system to keep applications running and maintain an acceptable level of service in the face of transient, intermittent, and permanent faults.” (HEC Resilience Report, 2009)

Data & Informa,on

Collec,on

Anomaly

detec,on

Visualiza,on Sta,s,cal

Analysis

Machine

Learning

Efficiency

Modeling &

Uncertainty

Quan,fica,on

Metrics &

Measurement

Simula,on &

Emula,on

Formal

Methods

Sta,s,cs &

Op,mal Control

SoF Errors

Silent Data

Corrup,on

Fault‐tolerant

Design

Fault

Injec,on Forward

Migra,on &

Verifica,on

Degraded

Modes

PlaKorm &

Applica,on

Monitoring

Applica,on &

PlaKorm Knobs

Tunable Fidelity &

Quality of Service

RAS Theory &

Performability

Response & Recovery

Next‐genera,on

Architectures

Programming

Models

System SoFware

& Middleware

RAS Systems

Tools Standards &

Standard

Framework

Nailing down resilienceResilience is a cross-domain challenge!

Fault-Tolerance Workshop (2009)

Resilience Layer

The architecture of a resilience feedback-control infrastructure

User-Centric Requirements

Job Input Parameters

Job Control and Resource

Allocation + Application

Configuration

System State

Application and System Monitoring

Performability Model

Resilience is a cross-stack challenge!


When HPC gives you lemons…

Opportunities for Innovation

Fault Characterization

Algorithm Based Fault Tolerance

Fault Analysis Tools

Fault Prediction and Detection

Fault-Tolerant System Software

Fault Aware Programming

Models

RESILIENCE

Sorting as iterative optimization

Sloan, Kesler, Kumar and Rahimi, “A Numerical Optimization-based Methodology for Application Robustification”, Dependable Systems and Networks (DSN), 2010.

Iterative asynchronous algorithms

Charr, J. and Couturier, R. and Laiymani, D., “JACEP2P-V2: A Fully Decentralized and Fault Tolerant Environment for Executing Parallel Iterative Asynchronous Applications on Volatile Distributed Architectures,” FGCS, 2011, pp. 606—613.

Bahi, J. and Couturier, R. and Vuillermin, P., “Asynchronous iterative algorithms for computational science on the grid: three case studies,” VECPAR, 2004, pp. 302—314.

Developing fault-tolerant solvers

Hoemmen, M. and Heroux M., “Fault-Tolerant Iterative Methods via Selective Reliability,” Tech. Rep. SAND2011-3915 C, Sandia National Laboratories, 2011.

Developing fault-tolerant solvers

Hoemmen, M. and Heroux M., “Fault-Tolerant Iterative Methods via Selective Reliability,” Tech. Rep. SAND2011-3915 C, Sandia National Laboratories, 2011.

Can we use approaches like

this for discrete mathematics?

Probabilistic computing: a starting point?

George, J., “Harnessing Resilience: Biased Voltage Overscaling for Probabilistic Signal Processing,” Doctoral Dissertation, 2011.

Probabilistic computing: a starting point?

George, J., “Harnessing Resilience: Biased Voltage Overscaling for Probabilistic Signal Processing,” Doctoral Dissertation, 2011.

A case for recovery-driven design

- Sartori, J. and Sloan, J. and Kumar, R. “Stochastic Computing: Embracing Errors in Architecture and Design of Processors and Applications,” CASES, 2011.

A case for recovery-driven design

- Sartori, J. and Sloan, J. and Kumar, R. “Stochastic Computing: Embracing Errors in Architecture and Design of Processors and Applications,” CASES, 2011.

Counting the costs• How much am I willing to pay for reliability?• How much am I already paying?• What am I giving up?

– Power? – Performance? – Other?

• Can I give up reliability and get something useful back in exchange?

Resilience Tradeoffs

Thanks to John Shalf, Lawrence Berkeley National Laboratory

TMR

PairingChecksum2Arrays

FT6HPL

ABFTResilient2Math2Formulation

Conclusion

Resilience is a call to innovation in HPC

Documents

Failure, Resilience, Opportunity and Innovationsalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/daly.pdfFailure, Resilience, Opportunity and Innovation John Daly, U.S. Department of Defense