Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Failure, Resilience, Opportunity and Innovation
John Daly, U.S. Department of Defense Salishan High Speed Computing Conference
April 27, 2015
Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future
http://en.wikipedia.org/wiki/Tandem_Computers
Resilience at Scale
http://en.wikipedia.org/wiki/Tandem_Computers
Resilience at Scale“Diversity jolts us into cognitive action in ways that homogeneity simply does not…. For this reason, diversity appears to lead to higher-quality scientific research.” - Scientific American, Volume 331, Issue 4, 2014.
“If you always do what you always did; you will always get what you always got.” - Albert Einstein
Innovation and Failure
https://www.youtube.com/watch?v=iJAq6drKKzE
Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future
Scientific – Bell Labs Relay Calculator
http
://w
ww
.com
pute
rhis
tory
.org
/revo
lutio
n/bi
rth-o
f-th
e-co
mpu
ter/4
/85/
342
Scientific – Bell Labs Relay Calculator
“Starting with the Model III delivered to the Armed Forces in 1944, not one of our customers has reported their computers giving out a wrong answer as the result of a machine error.”
- Second Symposium on Large Scale Digital Calculating Machinery, 1949.
http
://w
ww
.com
pute
rhis
tory
.org
/revo
lutio
n/bi
rth-o
f-th
e-co
mpu
ter/4
/85/
342
Bi-quinary notation
Business – UNIVAC I (1951)• 5200 vacuum
tubes • 29,000 lbs • 125 kW • 2.25 MHz
clock • 66 hours
mean time to system failure 470 million instructions per hard stop
http://commons.wikimedia.org/wiki/File:UNIVAC-I-PRL61-0977.jpg
http://en.wikipedia.org/wiki/Colossus_computer
Intel – Colossus (1944)
"speed was the essence”
- Dorothy Du Boisso
n
“I was instructed to destroy all the records, which I did. I took all the drawings and the plans and all the information about Colossus on paper and put it in the boiler fire. And saw it burn.” - Tommy Flowers
• 2400 vacuum tubes
• ??? lbs • ??? kWatts • 5.8 MHz clock
• ??? MTTF
A call to confront faults scientifically
John von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components”, from a lecture delivered at the California Institute of Technology, January 1952
Rapid pace of innovation385 instructions per second
http://en.wikipedia.org/wiki/ENIAC
1945 – ENIAC becomes first general purpose electronic computer 1947 – Bardeen, Brattain and Shockley develop transistor at Bell Labs 1947 – Richard Hamming develops codes for error correction and detection 1954 – IBM 608 becomes first commercial all-transistor calculator
Fault-tolerance is about redundancy• In space…
– Hardware• Prediction and migration• Detection and standby
sparing• Modular redundancy
– Software• Redundant copies of code• Redundant versions of
code
• In time…– Coding techniques
• Error correcting codes (data)
• Residue codes (arithmetic)
Fault-tolerance is about redundancy• In space…
– Hardware• Prediction and migration• Detection and standby
sparing• Modular redundancy
– Software• Redundant copies of code• Redundant versions of
code
• In time…– Coding techniques
• Error correcting codes (data)
• Residue codes (arithmetic)– Recovery blocks
• Checkpoint and rollback• Transactional programming• Fault containment domains
Hardware / software reliability
Fault ErrorA
ctiv
atio
nFailure
Prop
agat
ion
Underlying system state is erroneous
Delivered service deviates from specified service
System state observed to be erroneous
Hardware / software reliability
Fault ErrorA
ctiv
atio
nFailure
Prop
agat
ion
Fault Latency Error Latency
Error latency cannot be measured on a real system
Tandem NonStop VLX (1986)• 2-16 procs • 256 MBytes • 12 MHz
• 240,000 hours mean time to system failure
Non-Stop II System (1981)
“Unlike the situation with hardware components, it is possible to develop perfect, defect-free, failure proof software. It is only a matter of cost to the manufacturer and inconvenience to the customer who must wait much longer for some needed software to be delivered.”
- Bartlet, et al., Fault Tolerance in Tandem Computer Systems, 1990.
http://en.wikipedia.org/wiki/Tandem_Computers
Tandem NonStop VLX (1986)• 2-16 procs • 256 MBytes • 12 MHz
• 240,000 hours mean time to system failure
Non-Stop II System (1981)
Fail-Fast O
peration = fault-i
ntolerant
http://en.wikipedia.org/wiki/Tandem_Computers
2.6 peta instructions per hard stop
(>1,000,000x in 35 years)
Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future
ASCI Q (2002)• ES-45 cluster
• 4096 cores
• 10 Tflops • 6.5 hours
MTBF
http://ya-ru.ru/10-samyx-bystryx-kompyuterov-v-mire
23 Pflops per hard stop
(<10x in 17 years)
U N C L A S S I F I E D
U N C L A S S I F I E D
Operated by the Los Alamos National Security, LLC for the DOE/NNSA LA-UR-07-4292/5853/6490
Slide 5
Defining Solve Efficiency in Terms of How the System is Spending its Time*
€
Solve Efficiency =tstr⋅trtop
=tstop
Functional Status KeyFully-
Functionalt
Partially-Functional
t!
Non-Functional
t
OperationTimetop
IntegrationTimetint
TotalTimettot
DefensiveIO Time
tdio / t!dio
RestartTime
trst / t!rst
ReworkTime
trwk / t!rwk
ComputeTime
tcmp / t!cmp
ProductiveIO Time
tpio / t!pio
SolveTime
ts / t!s
Fault TolerantTime
tf / t!f
ArchivalStorage Time
ta / t!a
ProductionTime
tpr / t!pr
UnscheduledDowntime
tusch
ScheduledDowntime
tsch
ExternalFailure Time
tef
InternalFailure Time
tif
ExternalUnusable Time
teu
InternalUnusable Time
tiu
ReservedTime
trsv / t!rsv
IdleTime
tidle / t!idle
RunTime
tr / t!r
Application States
Inspired by Jon Stearley (based on SEMI-E10)J. Stearley. Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS). In Proceedings of the Linux Clusters Institute Conference, 2005. See http://www.cs.sandia.gov/~jrstear/ras.
* Proposed in collaboration with S. Michalak (LANL) and L. Davey (LANL)
What are we trying to measure?
http://ascii.jp/elem/000/00/982/982434/index-4.html
Red Storm (2005)
U N C L A S S I F I E D
U N C L A S S I F I E D
Operated by the Los Alamos National Security, LLC for the DOE/NNSA LA-UR-07-4292/5853/6490
Slide 3
Operations Rate Only Tells Part of the Story: Red Storm From The Application�s Perspective
5000 Node Job Daily Availability, 7-Day Average MTTI, and Efficiency
(Cumulative Availability = 60% and Cumulative Efficiency = 63%)
80%
70%
70%
67%
68%
64%
64%
59%
58%
56%
52%
41% 46%
39%
63%
58%
55%
62%
61%
62%
65%
62% 67%
55%
70%
54%
10.7
7.2
8.2
7.2
8.5
8.3
6.6
5.6
5.7
5.1
3.9
3.3
3.0
2.6
4.2
3.8
3.7
4.6
4.0 4.5
5.4
4.3
5.2 5.5 6.1
7.6
0
4
8
12
16
20
24
01/14/0
6
01/21/0
6
01/28/0
6
02/04/0
6
Nu
mb
er
of
Inte
rr
up
ts o
r T
ime
in
Ho
urs
Production Availability
System Interrupts
Application Interrupts
Runtime Efficiency
MTTI (System Only)
MTTI (Sys + App)
• 10,000 cores
• 36 Tflops • < 10 hours MTBF
(early adopter)
Ah, Checkpoint RestartCheckpointing Efficiency and the Optimum Checkpoint Interval as
Functions of the Dump Time, System MTBI, and Restart Overhead
0.00
0.25
0.50
0.75
1.00
0.01 0.1 1 10
Tsolv
e /
Tw
all
0.01
0.1
1
10
100
t c /
M
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
!
2 td
/M
R / M
!
t c"
2 t d
M
!
t c"M
+t d
Silent Data Corruption
CC BY-NC 4.0
Karl-Heinz Winkler (2006)
“Resilience = keep the application going at scale, despite component failures”
Karl-Heinz Winkler (June, 2007)
National HPC Workshop (2009)
Some more reliability data*
*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf
Some more reliability data*
*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf
Some more reliability data*
*DeBardeleben, Laros, Daly, Scott, Engelmann, Harrod. High-End Computing Resilience. http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end.pdf
Some more reliability data*
Why are fault rates rising?• Number of components is going up which will increase hard and
soft faults• Smaller circuit sizes, running at lower voltages to reduce power
increase the impact of thermal noise and radiation induced faults• Power management cycling significantly decreases component
lifetimes due to thermal and mechanical stresses• Resistance to adding additional detection and recovery logic on
the chip because of additional power consumption and chip costs• Heterogeneous systems make error detection and recovery even
harder• Increasing system and algorithm complexity makes faulty
interaction of components more likely
Thanks to Al Geist (ORNL) and Sudip Dosanjh (LBNL)
Fault Classification• Type
– Permanent/Hard – continuous and stable events on the system – Intermittent/Soft – occasional events, cause intrinsic to system – Transient/Soft – occasional events, cause extrinsic to system
• Extent – Single-event – independent events that alter only a single
component of system hardware or software state – Multi-event/common cause – correlated events that alter more
than one component of system state
Defining resilience• “The persistence of service delivery that can
justifiably be trusted, when facing changes.” (LaPrie, 2008)
• “The persistence of performability when facing changes.” (Meyer, 2009)
• “The ability of a system to keep applications running and maintain an acceptable level of service in the face of transient, intermittent, and permanent faults.” (HEC Resilience Report, 2009)
Data & Informa,on
Collec,on
Anomaly
detec,on
Visualiza,on Sta,s,cal
Analysis
Machine
Learning
Efficiency
Modeling &
Uncertainty
Quan,fica,on
Metrics &
Measurement
Simula,on &
Emula,on
Formal
Methods
Sta,s,cs &
Op,mal Control
SoF Errors
Silent Data
Corrup,on
Fault‐tolerant
Design
Fault
Injec,on Forward
Migra,on &
Verifica,on
Degraded
Modes
PlaKorm &
Applica,on
Monitoring
Applica,on &
PlaKorm Knobs
Tunable Fidelity &
Quality of Service
RAS Theory &
Performability
Response & Recovery
Next‐genera,on
Architectures
Programming
Models
System SoFware
& Middleware
RAS Systems
Tools Standards &
Standard
Framework
Nailing down resilienceResilience is a cross-domain challenge!
Fault-Tolerance Workshop (2009)
Resilience Layer
The architecture of a resilience feedback-control infrastructure
User-Centric Requirements
Job Input Parameters
Job Control and Resource
Allocation + Application
Configuration
System State
Application and System Monitoring
Performability Model
Resilience is a cross-stack challenge!
Outline• The call to innovation • A brief history of computer reliability • Resilience comes of age • Opportunities for the future
When HPC gives you lemons…
Opportunities for Innovation
Fault Characterization
Algorithm Based Fault Tolerance
Fault Analysis Tools
Fault Prediction and Detection
Fault-Tolerant System Software
Fault Aware Programming
Models
RESILIENCE
Sorting as iterative optimization
Sloan, Kesler, Kumar and Rahimi, “A Numerical Optimization-based Methodology for Application Robustification”, Dependable Systems and Networks (DSN), 2010.
Iterative asynchronous algorithms
Charr, J. and Couturier, R. and Laiymani, D., “JACEP2P-V2: A Fully Decentralized and Fault Tolerant Environment for Executing Parallel Iterative Asynchronous Applications on Volatile Distributed Architectures,” FGCS, 2011, pp. 606—613.
Bahi, J. and Couturier, R. and Vuillermin, P., “Asynchronous iterative algorithms for computational science on the grid: three case studies,” VECPAR, 2004, pp. 302—314.
Developing fault-tolerant solvers
Hoemmen, M. and Heroux M., “Fault-Tolerant Iterative Methods via Selective Reliability,” Tech. Rep. SAND2011-3915 C, Sandia National Laboratories, 2011.
Developing fault-tolerant solvers
Hoemmen, M. and Heroux M., “Fault-Tolerant Iterative Methods via Selective Reliability,” Tech. Rep. SAND2011-3915 C, Sandia National Laboratories, 2011.
Can we use approaches like
this for discrete mathematics?
Probabilistic computing: a starting point?
George, J., “Harnessing Resilience: Biased Voltage Overscaling for Probabilistic Signal Processing,” Doctoral Dissertation, 2011.
Probabilistic computing: a starting point?
George, J., “Harnessing Resilience: Biased Voltage Overscaling for Probabilistic Signal Processing,” Doctoral Dissertation, 2011.
A case for recovery-driven design
- Sartori, J. and Sloan, J. and Kumar, R. “Stochastic Computing: Embracing Errors in Architecture and Design of Processors and Applications,” CASES, 2011.
A case for recovery-driven design
- Sartori, J. and Sloan, J. and Kumar, R. “Stochastic Computing: Embracing Errors in Architecture and Design of Processors and Applications,” CASES, 2011.
Counting the costs• How much am I willing to pay for reliability?• How much am I already paying?• What am I giving up?
– Power? – Performance? – Other?
• Can I give up reliability and get something useful back in exchange?
Resilience Tradeoffs
Thanks to John Shalf, Lawrence Berkeley National Laboratory
TMR
PairingChecksum2Arrays
FT6HPL
ABFTResilient2Math2Formulation
Conclusion
Resilience is a call to innovation in HPC