Upload
persistent-systems-ltd
View
37.952
Download
15
Tags:
Embed Size (px)
DESCRIPTION
Prof Kishor Trivedi, Duke University, USA talks about Software Faults, Failures and Their Mitigations at the 6th Session of Turing100@Persistent
Citation preview
Copyright © 2013 by K.S. Trivedi
Copyright © 2013 by K.S. Trivedi
Kishor TrivediDuke High Availability Assurance Lab (DHAAL)
Dept. of Electrical & Computer EngineeringDuke University
Durham, NC [email protected]
www.ee.duke.edu/~kst
Software Faults, Failures and Their Mitigations
Persistent SystemsJanuary 5, 2013
Copyright © 2013 by K.S. Trivedi2
Duke UniversityResearch Triangle Park (RTP)
UNC-CHUNC-CH
DukeDuke
NC state
NC state
USAUSA North Carolina
North Carolina
Copyright © 2013 by K.S. Trivedi
Duke University
• U.S. News & World Report in its 2012 edition, – ranked the university's undergraduate program
8th among all universities in USA
3
Copyright © 2013 by K.S. Trivedi
Duke University
4
NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)
Copyright © 2013 by K.S. Trivedi
HARP (NASA), SAVE (IBM),IRAP (Boeing)SHARPE, SPNP, SREPT
TheoryTheory
ApplicationsApplications
Books: Blue, Red, White
Software Software PackagesPackages
Stochastic modeling methods & numerical solution methods: Large Fault trees, Stochastic Petri Nets,Large/stiff Markov & non-Markov modelsFluid stochastic modelsPerformability & Markov reward modelsSoftware aging and rejuvenationSecurity, Survivability, Resilience quantificationMachine Learning
Trivedi’s research triangle
5
Reliability/availability/performanceAvionics (Boeing), Space (NASA/JPL), Power systems (GE), Automobile systems (GM) Computer systems (EMC,SUN,HP,TCS)Telco systems (AT&T, Lucent, Avaya)Computer Networks (Motorola)Virtualized Data center (NEC)Cloud computing (IBM, NEC, Cisco)Software Aging & Rejuvenation (Huawei)
Copyright © 2013 by K.S. Trivedi
Textbooks
Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Second edition,
John Wiley, 2001 (Bluebook) [First edition by Prentice-Hall, 1982]
Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the
SHARPE Software Package, Kluwer (now Springer), 1996 (Redbook)
Queuing Networks and Markov Chains, John Wiley, second edition, 2006 (White book)
6
Copyright © 2013 by K.S. Trivedi
DHAAL & IndustryA Success Story
• Reliability Prediction of Boeing 787 Current Return Network for FAA Certification
• Security Quantification (DARPA SITAR)/NSF• Survivability Quantification for Lucent POTS and Siemens Smartgrid• Cloud performance, availability, power with IBM and Cisco• Reliability/Availability Prediction of SIP protocol on IBM WebSphere• Software Aging and Rejuvenation: state of the art, theory, measurements,
and implementation (IBM x-series); Huawei• Cloud computing security (Measurements, quantification, and
implementation) - NATO Science for Peace and Security• NASA-JPL Failures data analytics• NEC collaboration for Performability Management (VMs allocation, VMs
and VMM rejuvenation, etc) in Virtualized Data Center• Data analytics (statistical and machine learning techniques used to
interpret huge volume of data) - WiPro Technologies• Software Reliability/Availability/Performability analysis – Short courses,
seminars, and consulting - Tata consulting services
77
Copyright © 2013 by K.S. Trivedi
Outline
• Motivation • A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions
Copyright © 2013 by K.S. Trivedi
Pervasive Dependence on Computer SystemsNeed for High Reliability/Availability
Health & Medicine Avionics
Entertainment
Banking
Communication
Copyright © 2013 by K.S. Trivedi
Basic Definitions
• Steady-state availability (Ass) or just availability Long-term probability that the system is available when
requested:
MTTF is the system mean time to failure, a complex combination of component MTTFs
MTTR is the system mean time to recovery- may consist of many phases
MTTRMTTF
MTTF
ssA
Copyright © 2013 by K.S. Trivedi
Basic Definitions
• Downtime in minutes per year(un)availability is usually presented in terms of annual downtime.
– Downtime = 876060 (1- Ass) minutes.
– 5 NINES (Ass = 0.99999) 5.26 minutes annual downtime
Copyright © 2013 by K.S. Trivedi
Number of Nines– Reality Check
• 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week
– Approx. 80 hours/year=4800 minutes/year
– Ass=(8760-80)/8760=0.9908
– That is, between 2 NINES and 3 NINES!
• This study assumes planned and unplanned downtime, together
Copyright © 2013 by K.S. Trivedi
Achieving High Availability is a Challenge
• Black Sept. 2011, In the same week!!!!:– Microsoft Cloud service outage (2.5 hours)– Google Docs service outage (1 hour)
• A memory leak due to a software update
• Sept. 2012 GoDaddy (4 hours)– 5 millions of websites affected
• Oct. 2012 Amazon – 10/15/2012 Webservices – 6 hours (Memory leak)– 10/27/2012 EC2 – > 2 hours
Copyright © 2013 by K.S. Trivedi
Downtown Costs per Hour
• Brokerage operations $6,450,000• Credit card authorization $2,600,000• eBay (1 outage 22 hours) $225,000• Amazon.com $180,000• Package shipping services $150,000• Home shopping channel $113,000• Catalog sales center $90,000• Airline reservation center $89,000• Cellular service activation $41,000• On-line network fees $25,000• ATM service fees $14,000
Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."
Copyright © 2013 by K.S. Trivedi
High Reliability/Availability
• Hardware fault tolerance, fault management, reliability/availability modeling/assurance relatively well developed
• System outages more due to software faults
• Key Challenge:
Software reliability is one of the weakest links in system
reliability/availability
Copyright © 2013 by K.S. Trivedi
Software is the problem
161985
2005
Across different industries….
Jim Gray’s paper titled “Why do computers stop and what can be done about it?”started to pointed out this trend in 1985, followed by his paper “A census of tandem system availability between 1985 and 1990”
Copyright © 2013 by K.S. Trivedi
Increasing SW Failure Rate?Planetary Missions Flight Software: A. Nikora of JPL
Mission Name (in launch order)
Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep MarsGlobal Climate Polar Odyssey Exploration Impact Reconnaissance
Surveyor Orbiter Lander Rover Orbiter
The interval between the first and last launch: 8.76 years.
The interval between successive launches ranges from:23 to 790 days.
Similar results for ground software
Copyright © 2013 by K.S. Trivedi
High Reliability/Availability: Software is the problem
• Fault avoidance – good software engineering practices – difficult for large/complex software systems
– Impossible to fully test and verify if software is fault-free“Testing shows the presence, not the absence, of bugs”
- E. W. Dijkstra
• Yet there are stringent requirements for failure-free operation
Copyright © 2013 by K.S. Trivedi
High Reliability/Availability: Software is the problem (2)
Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software
Copyright © 2013 by K.S. Trivedi
Software Fault Tolerance Classical Techniques
Design diversity– N-version programming– Recovery block
Expensive not used much in practice!
Yet there are stringent requirements for failure-free operation
Challenge: Affordable Software Fault Tolerance
Copyright © 2013 by K.S. Trivedi
Outline
• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions
Copyright © 2013 by K.S. Trivedi
High availability SIP Application Server Configuration on IBM WebSphere
Blade 4
IP Sprayer-IBM Load Balancer
-
SIP
IBM PC
Replication
Group 3
Blade 2
Blade 3
Blade 4
Blade 2
AS 1
AS 2
AS 3
AS 4
AS 5
AS 1
AS 4
AS 2
AS 5
AS 6
AS 3
AS 6
Blade 3
Replication Domain 1
Replication Domain 2
Replication Domain 3
SIP
Proxy 1
SIP Proxy 1
Blade 1
Blade 1
Replication Domain 4
Replication Domain 5
Replication Domain 6
Blade Chassis 1
Blade Chassis 2
Blade 4
Test Driver
Test Driver
Test drivers
DM
AS1 thru AS6 are Application Server
Proxy1's are Stateless Proxy Server
PRDC 2008 and ISSRE 2010 papers
Copyright © 2013 by K.S. Trivedi
High availability SIP Application Server configuration on WebSphere
Hardware configuration: – Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis
sufficient for performance)
Software configuration:– 2 copies of SIP/Proxy servers (1 sufficient for performance)
– 12 copies of WAS (6 sufficient for performance)
– Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis
• The system has hardware redundancy and software redundancy
Copyright © 2013 by K.S. Trivedi
High availability SIP Application Server configuration on WebSphere
Software Fault Tolerance– Identical copies of SIP proxy used as backups (hot spares)– Identical copies of WebSphere Applications Server (WAS) used
as backups (hot spares)– Type of software redundancy – (not design diversity) but
replication of identical software copies– Normal recovery
• restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked
Copyright © 2013 by K.S. Trivedi
Escalated levels of Recovery A real example Avaya Servers and Media Gateways
Single Process Restart (SPR)
IF 3 SPR within
60 seconds
NO
System Warm Restart (SWR)
YES
IF 3 SWR
within 15 min
NO System Cold Restart (SCR)
IF 3 SCR within 15 min
NO
Avaya Communication
Manager Software reloads
(ACMSR)
YES
YES
IF 3 ACMSR within 15 min
OS RebootYES
NO
The flowchart depicts the actions taken for recovery after a failure is detected. Try the simplest recovery method first, then a more complex etc.
Copyright © 2013 by K.S. Trivedi
Software Fault Tolerance: New Thinking
Retry, restart, reboot!
– Known to help in dealing with hardware transients
– Do they help in dealing with failures caused by software bugs?
– If yes, why?
Copyright © 2013 by K.S. Trivedi
A Cartoon
Why is this true………at least for computers?
Copyright © 2013 by K.S. Trivedi
Software Fault Tolerance: New Thinking
Failover to an identical software replica (that is not a diverse version)
– Does it help?
– If yes, why?
Twenty years ago this would be considered crazy!
Copyright © 2013 by K.S. Trivedi
Outline
• Motivation• A Real System• Software Fault Classification
– Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007
• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions
Copyright © 2013 by K.S. Trivedi
Software Faults main threats to high reliability,
availability & safety
Copyright © 2012 by K.S. Trivedi
Copyright © 2013 by K.S. Trivedi31
IFIP Working Group 10.4 (Laprie)
• Failure occurs when the delivered service no longer complies with the desired output.
• Error is that part of the system state which is liable to lead to subsequent failure.
• Fault is adjudged or hypothesized cause of an error.
Faults are the cause of errors that may lead to failuresFault Error Failure
Copyright © 2013 by K.S. Trivedi
Need to Classify bug types
• We submit that a software fault tolerance approach based on retry, restart, reboot or fail-over to an identical software replica (not a diverse version) work because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs
Copyright © 2013 by K.S. Trivedi
Need to Classify bug types
• In recent years, researchers have reported the phenomenon of “software aging” (i.e., degraded performance and/or increased failure rate of long-running software systems).
• Puzzle: How can performance and failure rate change if the software code is not modified?!
Study software fault types and their relationships
Copyright © 2013 by K.S. Trivedi
Jim Gray’s Definitions
• The terms “Bohrbug” and “Heisenbug” were first used in print by Jim Gray in 1985.
• “Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring.”
• “Most production software faults are soft. If the program state is reinitializedand the failed operation is retried,
the operation will not fail a second time. … The assertion that most production software bugs are soft – Heisenbugs that go away when you look at them – is well known to systems programmers.” (Gray, 1985)
J. Gray
Copyright © 2013 by K.S. Trivedi
Bruce Lindsay’s Definition
• “Heisenbugs as originally defined … are bugs in which clearly the system behavior is incorrect,
and when you try to look to see why it’s incorrect, the problem goes away.” (Lindsay, 2004)
• The term alludes to the physicist Werner Heisenberg and his Uncertainty Principle.
Based on Gray’s paper, researchers have often equated Heisenbugs with soft faults.
However, when Bruce Lindsay originally coined the term in the 1960s (while working with Jim Gray), he had a more narrow definition in mind.
B. Lindsay, photo by T. Upton
Copyright © 2013 by K.S. Trivedi
Heisenbug – Our Definition
• Heisenbug := A fault that stops causing a failure or that manifests differently when one attempts to probe or isolate it.
• How can probing affect the bug?1. Some debuggers initialize unused
memory to default values, thus preventing failures due to improper initialization.
2. Trying to investigate a failure can influence process scheduling in such a way that a scheduling-related failure does not occur again.
Copyright © 2013 by K.S. Trivedi
Example: A bug causing a failure
A Classification of Software Faults
• Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity.
whenever the user enters a negative date of birth Since they are easily found, Bohrbugs tend to be
detected and fixed during the software testing phase.
The term alludes to the physicist Niels Bohr and his rather simple atomic model.
Copyright © 2013 by K.S. Trivedi
Mandelbug – Definition
• Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible.
• Example: A bug whose
activation is scheduling-dependent The residual faults in a thoroughly-tested
piece of software are mainly Mandelbugs. The term alludes to the mathematician Benoît
Mandelbrot and his research in fractal geometry.
Copyright © 2013 by K.S. Trivedi
41Mandelbugs: Consequences• Mandelbugs are difficult to detect and remove
during the software testing phase.• An operation that failed due to a Mandelbug may
execute correctly upon retry even if the fault has not been removed; changing the environment may suffice.
• Potential recovery techniques:– “Microreboot” of individual components– Application restart– System reboot– Failover to a standby component (replicate)– Manual recovery
Copyright © 2013 by K.S. Trivedi
Examples of Types of Bugs in IT System
• Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011.
• The projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector.
42
Copyright © 2013 by K.S. Trivedi
Aging-related Bug – Definition
• Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance.
Example: A bug causing memory leaks in the application
Note that the aging phenomenon requires a delay between fault activation and failure occurrence.
Note also that the software appears to age due to such a bug; there is no physical deterioration
Copyright © 2013 by K.S. Trivedi
Relationships
Bohrbug and Mandelbug are complementary antonyms.
Aging-related bugs are a subtype of Mandelbugs
Aging-Related Bugs
Bohrbugs
Mandelbugs
Aging
-
Related Bugs
Bohrbugs
Mandelbugs
Copyright © 2013 by K.S. Trivedi
Important Questions about these Bugs
• What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs– How do these fractions vary
• over time• over projects, languages, application types,…
– Need Measurements – Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary
results from one NASA software project: • 52% Bohrbugs• 35% Mandelbugs (non-aging-related)• 4% Aging-related bugs• 7% Operator related• 2% Unclassified
– Very similar results for Linux, MySQL, Apache AXIS, httpd• What are the methods of mitigation for the different fault types
Copyright © 2013 by K.S. Trivedi
Trends in SW Fault Type ProportionsPlanetary Missions Flight Software
• Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions analyzed)
• Result: The proportion of Bohrbugs seems to settle at around the same value. Such a convergence to similar values is less obvious for the other fault types.
Copyright © 2013 by K.S. Trivedi
Outline
• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions
Copyright © 2013 by K.S. Trivedi
Environmental diversityA new thinking to deal with software faults and failures
Copyright © 2012 by K.S. Trivedi
Copyright © 2013 by K.S. Trivedi
Software Fault Tolerance: New Thinking
New thinking: Environmental Diversity as opposed to Design Diversity
Our claim is that this works since failures due to Mandelbugs are not negligible, we have an affordable software fault tolerance technique that we call
Environmental Diversity
Copyright © 2013 by K.S. Trivedi
What is environmental diversity?
• The underlying idea of Environmental diversity – Retry a previously faulty operation and it works– Why?– because of the environment where the operation was
executed has changed enough to avoid the fault activation.
• The environment is understood as– OS resources, other applications running concurrently and
sharing the same resources, interleaving of operations, concurrency, or synchronization.
Copyright © 2013 by K.S. Trivedi
What is environmental diversity?
• The execution of an application depends on the environment
Hardware
Operating System
Ap1 Ap2 Ap3 Ap4
Hardware
Operating System
Ap4 Ap6 Ap3 Ap5
Restart the application
Environment at time t1Environment at time t1+n
Copyright © 2013 by K.S. Trivedi
Environmental Diversity• Restart an application, reboot a node or failover to an
identical standby replica work because of the environmental diversity that will be underlying these actions;– By environment here we mean the resources of the OS, other
applications running concurrently and sharing system resources, interleaving of operations, concurrency, synchronization etc.
• Environmental Diversity uses time redundancy over expensive design diversity
• [Adams] Restart• [Jalote et al.] Rollback, rollforward • [Patriot] Occasional reboot, “switch off and on”• [Avaya Swift] restart process; failover to a replica• [IBM SIP] escalated levels: restart, reboot, failover…• [IBM Director-X-series] Rejuvenation
Copyright © 2013 by K.S. Trivedi
Outline
• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions
Copyright © 2013 by K.S. Trivedi
Methods of Mitigation
Copyright © 2012 by K.S. Trivedi
Copyright © 2013 by K.S. Trivedi
Mitigation
Copyright © 2013 by K.S. Trivedi
Bohrbugs: Remove
Find and fix the bugs during testing
Failure data collected during testing
Calibrate a software reliability growth model (SRGM) using failure data; this model is then used for prediction
Many SRGMs exist (JM,NHPP,HGRGM, etc.) Books by Lyu, Musa, Cai
Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals of Software Engineering, 1999
Measurements Empirical (statistical) models
Copyright © 2013 by K.S. Trivedi
Mitigation
Copyright © 2013 by K.S. Trivedi
OS Availability Model (IBM BladeCenter)
UP DN DWlOS
mOS
RPasp
bOSbOS
(1-bOS)bOSDT
dOS
Reboot (Failure due to a Mandelbug)
Fix (Failed due to a
Bohrbug)
Copyright © 2013 by K.S. Trivedi
Availability model of a Proxy or a WAS (IBM SIP on websphere)
UA UR UB(1-r)rm
rrmqra
(1-q)ra
bbm
RE(1-b)bm
m
UOUPg ed2
1D
ed2dd1
(1-e)d2
UN
dm
1N
(1-d)d1
(1-e)d2
2N (1-d)d1
(1-e)d2
ed2 dd1
Application server and proxy server
• Failure detection
– By WLM
– By Node Agent
– Manual detection
• Recovery
– Node Agent• Auto process restart
– Manual recovery• Process restart• Node reboot• Repair
Copyright © 2013 by K.S. Trivedi
Outline
• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions
Copyright © 2013 by K.S. Trivedi
Aging Related Bugs: Replicate, Restart, Reboot, Rejuvenate
Copyright © 2012 by K.S. Trivedi
Copyright © 2013 by K.S. Trivedi
Software Aging
Aging phenomenonError conditions accumulating over time
Main causes of Software AgingMemory leak, fragmentation, Unterminated threads, Data corruption, Round-
off errors, Unreleased file-locks, etc
Observed systemOS, Middle-ware, Netscape, Internet Explorer etc
Performance degradation, system failurePerformance degradation, system failure
Copyright © 2013 by K.S. Trivedi
Software Aging - Definition
“Software Aging” phenomenon
Long-running software tends to show an increasing failure rate.
Not related to application program becoming obsolete due to changing requirements/maintenance.
Software appears to age; no real deterioration
Copyright © 2013 by K.S. Trivedi
Software Aging - Examples
• Cisco Catalyst Switch [Matias Jr.]• File system aging [Smith & Seltzer]• Gradual service degradation in the AT&T transaction processing
system [Avritzer et al.]• Error accumulation in Patriot missile system’s software [Marshall] • Resources exhaustion in Apache [Li et al., Grottke et al.]• Physical memory degradation in a SOAP-based Server [Silva et al.]• Software aging in Linux [Cotroneo et al.]• Crash/hang failures in general purpose applications after a long
runtime
Copyright © 2013 by K.S. Trivedi
Measurements Showing Resource Exhaustion or Depletion
Real Memory Free File Table Size
A Methodology for Detection and Estimation of Software Aging,
S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi.
Proc. of IEEE Intl. Symp. on Software Reliability Engineering, Nov. 1998.
Copyright © 2013 by K.S. Trivedi
Software Fault Types & Their Mitigation
Copyright © 2013 by K.S. Trivedi
Software rejuvenation
Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes.
It allows proactive recovery to be carried either automatically or at the discretion of the user/administrator
Rejuvenation of the environment, not of software
Copyright © 2013 by K.S. Trivedi
Software Rejuvenation
Counteracts the software aging phenomenonFrees up OS resources; Removes error accumulation
Common techniques for cleaningGarbage collection, defragmentation, flushing kernel and file server tables etc.
Challenge: Rejuvenation scheduling/granularity
Copyright © 2013 by K.S. Trivedi
SW Rejuvenation: The Genesis
“Software Rejuvenation: Analysis, Module and Applications”, Y. Huang, C. Kintala, N. Koletis, N. Fulton, in FTCS 1995
An insight into operational software, that no-one had before (at least, formally). It changed
• How practitioners looked at making software more dependable
• Windfall of performance and dependability modeling problems for academicians
• Ideas to build better, real-world systems as Internet evolved
• Led to recognition of “software aging” phenomenon
• Brought about Phds, tenureships, publications, patents, awards, tools, systems, funding for many many people around the world.
Copyright © 2013 by K.S. Trivedi
Software RejuvenationExamples
AT&T billing applications [Huang et al.]
Patriot missile system software - switch off/on every 8 hours [Marshall]
On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.]
IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers]
Microsoft IIS 5.0 process recycling toolProcess restart in Apache [Li et al.]
ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS reports]
For more examples: "Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi, Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas, USA, 2012.
Copyright © 2013 by K.S. Trivedi
Software Rejuvenation –Trade-off
• Advantages– Reduces costs of sudden aging-related failures– Can be applied at the discretion of the user/administrator
• Disadvantages– Direct costs of carrying out rejuvenation– Opportunity costs of rejuvenation (downtime, decreased
performance, lost transactions etc)
Important research issue: Find optimal times to perform rejuvenation!
Copyright © 2013 by K.S. Trivedi
Software rejuvenation - Approaches
Two approaches based on WHEN:
Time-Based rejuvenation approaches
Rejuvenation applied regularly and at predetermined time intervals.
Widely used in real environments
– Web servers (Apache)– ISS two-months reboot– Telecommunication systems
Copyright © 2013 by K.S. Trivedi
Software rejuvenation - Approaches
Two approaches based on WHEN:
Measurement (Inspection)-based rejuvenation approaches
• Threshold based or predictive
• System metrics continually monitored
• Rejuvenation triggered when the crash is imminent based on the observation/prediction
• Reduce potentially useless rejuvenation actions and downtime in the process [Silva et al.]
Copyright © 2013 by K.S. Trivedi
Software rejuvenation
Copyright © 2013 by K.S. Trivedi
Software Rejuvenation – Approaches
Two approaches based on HOW:
Use analytical model to optimize rejuvenation schedule• Lucent Bell Labs [Huang et al., ‘95]
• Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]
• Others [IPDS’98, PNPM’99]
Copyright © 2013 by K.S. Trivedi
Software Rejuvenation – Approaches
Two approaches based on HOW:
Use measurements of resource degradation to determine/predict optimal rejuvenation schedule
• Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05]
• Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10]
Copyright © 2013 by K.S. Trivedi
Failure rate
Preventive maintenance is useful only if failure rate is increasing
If the time to failure distribution is exponential then failure rate is Constant
Need to assume (and establish) that TTF is IFR
Copyright © 2013 by K.S. Trivedi
Analytic Models
Single node models– CTMC model – SMP model
Cluster systems– IBM Cluster model (Time-based, condition-based)
Copyright © 2013 by K.S. Trivedi
Analytic ModelsSoftware Aging and Rejuvenation
A simple and useful model of increasing failure rate:Failure
probable state
Failed state
Time to failure: Hypo-exponential distribution
Increasing failure rate aging
Robust state
Copyright © 2013 by K.S. Trivedi
Analytic ModelsCTMC model [Huang95]
From this Continuous-time Markov chain modelCan find closed-form expression for the optimal rejuvenation trigger rate
(r4)
Model with rejuvenationModel w/o rejuvenation
Failed stater1
λ
r2
Failure probable state
Robust state
Failed state
Sf Sp
S0
λ
r2
r1 r3
r4
Robust state
Failure probable state
Rejuvenation
stateSf Sp
S0
Sr
Copyright © 2013 by K.S. Trivedi
Analytic ModelsSemi-Markov model [Dohi00]
Relax the assumption of exponentially distributed sojourn times (time-independent transition rates)
Hence have a semi Markov model
Can find closed-form expression for the optimal (deterministic) time to rejuvenation triggerSystem Failure
State change
Completion of Repair
Completion of Rejuvenation
Rejuvenation2 1
0
3
Copyright © 2013 by K.S. Trivedi
Rejuvenation in Cluster Systems
[Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself
Easier scaling to larger systems, high levels of availability/performance and low management costs
Cluster System
Copyright © 2013 by K.S. Trivedi
Rejuvenation for Cluster SystemsMotivation
Rejuvenation using the fail-over mechanisms
Long-terms benefits in terms of availability/performance
Continuous operation (possibly at a degraded level)
Practically zero downtime
Copyright © 2013 by K.S. Trivedi
Rejuvenation for Cluster SystemsMotivation
Less disruptive and lower overhead than unplanned outages
Transparent to user/application
Most current industry initiatives reactive
Two approachesSimple time-based (periodic)Condition-based (only from the “failure-impending” state
Copyright © 2013 by K.S. Trivedi
Rejuvenation for Cluster SystemsSRN Models
Rejuvenation using the fail-over mechanisms in a rolling fashion
Modeling using SRNs (Stochastic Reward Nets)
Analysis for 2 rejuvenation policies Simple time-based policy
• All nodes rejuvenated successively at the end of each rejuvenation intervalCondition-based policy
• Nodes rejuvenated only from the “failure-probable” state
Copyright © 2013 by K.S. Trivedi
SRN ModelBasic Cluster Model
Copyright © 2013 by K.S. Trivedi
SRN ModelSimple Time-Based Rejuvenation
Copyright © 2013 by K.S. Trivedi
Model Parameters
Transition Mean time
Tfprob 240 hours
Tnodefail 720 hours
Tnoderepair 30 mins
Tsysrepair 4 hours
Trejuv 10 mins
costnodefail $5000/hour
costnoderejuv $250/hour
Copyright © 2013 by K.S. Trivedi
Model Measures
Unavailability
(#Psysfail == 1) ? 1 : 0
Cost#Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail
Measures Computed
Copyright © 2013 by K.S. Trivedi
ResultsSimple Time-Based Rejuvenation
8/1 configuration 8/2 configuration
As rejuv. int. increases, rejuvenation is performed less frequently
When rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtime
When rejuv. int. goes beyond optimal value, system failures become frequent
resulting in high cost/downtime
Copyright © 2013 by K.S. Trivedi
Measuring Performance Variables
ObjectiveDetection and validation of aging
Periodically monitor and collect data on the attributes responsible for the “health” of the system
Quantify the effect of aging on system resourcesProposed metric – Estimated time to exhaustionProposed metric – Evaluation function using PCA approach
Copyright © 2013 by K.S. Trivedi
Measuring Performance Variables
ApproachesTime-based (workload-independent) estimation [Garg98]
Workload-based estimation [Vaidyanathan99]
ARMA/ARX models [Li02]
ALT and ADT techniques [Matias06]
Non-parametric Algorithms [Dohi00]
Non-linear models [Hoffman07]
Principal component Analysis (PCA) and System identification [Jia]
Pattern recognition [Vaidyanathan & Gross]
Threshold-based approaches [Silva09]
Machine Learning Approaches [Alonso10]
Copyright © 2013 by K.S. Trivedi
97
Data CollectionExperimental Setup
SNMP-based resource monitoring tool:
Data related to OS resource usage (memory, process table, file table etc.) and system activity (CPU usage, paging, swapping, NFS, interrupts etc. ) collected for over 3 months at 10 min intervals
Copyright © 2013 by K.S. Trivedi
98
Time PlotsNon-parametric Regression Smoothing
Real Memory Free File Table Size
Trend detection: Seasonal Kendall test for trend
Copyright © 2013 by K.S. Trivedi
IBM xSeries Software Rejuvenation Agent (SRA)
Implemented in a high-availability clustered environment
Monitors consumable resources, estimate time to exhaustion and generates alerts if within user notification horizon
Copyright © 2013 by K.S. Trivedi
IBM xSeries Software Rejuvenation Agent (SRA)
IBM Director system management tool– Provides GUI to configure SRA– Acts upon alerts
Two versions– Periodic rejuvenation– Prediction-based rejuvenation
Copyright © 2013 by K.S. Trivedi
Summary
It is possible to enhance software availability during operation exploiting environmental diversity
Multiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch)
Copyright © 2013 by K.S. Trivedi
Summary
Software aging not anecdotal – real life scientific phenomenon
Rejuvenation implemented in several special purpose applications and many general purpose cluster systems
Copyright © 2013 by K.S. Trivedi
Key References• Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis
and N. Fulton, In Proc. FTCS-25, June 1995. • A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K.
Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998. • Performance and Reliability Evaluation of Passive Replication Schemes in Application Level
Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS 1999.
• Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.
• Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, March 2001.
• A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-TDSC, April-June 2005.
• Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S. Trivedi, IEEE Trans. Reliability, Sept. 2006.
Copyright © 2013 by K.S. Trivedi
Key References
• Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, Feb. 2007.
• Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008.
• Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI, K., Maciel, P. , ISSRE, 2010.
• Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S. Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010.
• An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010.
• Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011.
• Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011
• O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book)• M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE
1996• R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based
Telecommunications System, ISSRE 1992