Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi


Kishor TrivediDuke High Availability Assurance Lab (DHAAL)

Dept. of Electrical & Computer EngineeringDuke University

Durham, NC [email protected]

www.ee.duke.edu/~kst

Software Faults, Failures and Their Mitigations

Persistent SystemsJanuary 5, 2013

Copyright © 2013 by K.S. Trivedi2

Duke UniversityResearch Triangle Park (RTP)

UNC-CHUNC-CH

DukeDuke

NC state

NC state

USAUSA North Carolina

North Carolina


Duke University

• U.S. News & World Report in its 2012 edition, – ranked the university's undergraduate program

8th among all universities in USA

3


Duke University

4

NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)


HARP (NASA), SAVE (IBM),IRAP (Boeing)SHARPE, SPNP, SREPT

TheoryTheory

ApplicationsApplications

Books: Blue, Red, White

Software Software PackagesPackages

Stochastic modeling methods & numerical solution methods: Large Fault trees, Stochastic Petri Nets,Large/stiff Markov & non-Markov modelsFluid stochastic modelsPerformability & Markov reward modelsSoftware aging and rejuvenationSecurity, Survivability, Resilience quantificationMachine Learning

Trivedi’s research triangle

5

Reliability/availability/performanceAvionics (Boeing), Space (NASA/JPL), Power systems (GE), Automobile systems (GM) Computer systems (EMC,SUN,HP,TCS)Telco systems (AT&T, Lucent, Avaya)Computer Networks (Motorola)Virtualized Data center (NEC)Cloud computing (IBM, NEC, Cisco)Software Aging & Rejuvenation (Huawei)


Textbooks

Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Second edition,

John Wiley, 2001 (Bluebook) [First edition by Prentice-Hall, 1982]

Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the

SHARPE Software Package, Kluwer (now Springer), 1996 (Redbook)

Queuing Networks and Markov Chains, John Wiley, second edition, 2006 (White book)

6


DHAAL & IndustryA Success Story

• Reliability Prediction of Boeing 787 Current Return Network for FAA Certification

• Security Quantification (DARPA SITAR)/NSF• Survivability Quantification for Lucent POTS and Siemens Smartgrid• Cloud performance, availability, power with IBM and Cisco• Reliability/Availability Prediction of SIP protocol on IBM WebSphere• Software Aging and Rejuvenation: state of the art, theory, measurements,

and implementation (IBM x-series); Huawei• Cloud computing security (Measurements, quantification, and

implementation) - NATO Science for Peace and Security• NASA-JPL Failures data analytics• NEC collaboration for Performability Management (VMs allocation, VMs

and VMM rejuvenation, etc) in Virtualized Data Center• Data analytics (statistical and machine learning techniques used to

interpret huge volume of data) - WiPro Technologies• Software Reliability/Availability/Performability analysis – Short courses,

seminars, and consulting - Tata consulting services

77


Outline

• Motivation • A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions


Pervasive Dependence on Computer SystemsNeed for High Reliability/Availability

Health & Medicine Avionics

Entertainment

Banking

Communication


Basic Definitions

• Steady-state availability (Ass) or just availability Long-term probability that the system is available when

requested:

MTTF is the system mean time to failure, a complex combination of component MTTFs

MTTR is the system mean time to recovery- may consist of many phases

MTTRMTTF

MTTF

ssA


Basic Definitions

• Downtime in minutes per year(un)availability is usually presented in terms of annual downtime.

– Downtime = 876060 (1- Ass) minutes.

– 5 NINES (Ass = 0.99999) 5.26 minutes annual downtime


Number of Nines– Reality Check

• 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week

– Approx. 80 hours/year=4800 minutes/year

– Ass=(8760-80)/8760=0.9908

– That is, between 2 NINES and 3 NINES!

• This study assumes planned and unplanned downtime, together


Achieving High Availability is a Challenge

• Black Sept. 2011, In the same week!!!!:– Microsoft Cloud service outage (2.5 hours)– Google Docs service outage (1 hour)

• A memory leak due to a software update

• Sept. 2012 GoDaddy (4 hours)– 5 millions of websites affected

• Oct. 2012 Amazon – 10/15/2012 Webservices – 6 hours (Memory leak)– 10/27/2012 EC2 – > 2 hours


Downtown Costs per Hour

• Brokerage operations $6,450,000• Credit card authorization $2,600,000• eBay (1 outage 22 hours) $225,000• Amazon.com $180,000• Package shipping services $150,000• Home shopping channel $113,000• Catalog sales center $90,000• Airline reservation center $89,000• Cellular service activation $41,000• On-line network fees $25,000• ATM service fees $14,000

Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."


High Reliability/Availability

• Hardware fault tolerance, fault management, reliability/availability modeling/assurance relatively well developed

• System outages more due to software faults

• Key Challenge:

Software reliability is one of the weakest links in system

reliability/availability


Software is the problem

161985

2005

Across different industries….

Jim Gray’s paper titled “Why do computers stop and what can be done about it?”started to pointed out this trend in 1985, followed by his paper “A census of tandem system availability between 1985 and 1990”


Increasing SW Failure Rate?Planetary Missions Flight Software: A. Nikora of JPL

Mission Name (in launch order)

Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep MarsGlobal Climate Polar Odyssey Exploration Impact Reconnaissance

Surveyor Orbiter Lander Rover Orbiter

The interval between the first and last launch: 8.76 years.

The interval between successive launches ranges from:23 to 790 days.

Similar results for ground software


High Reliability/Availability: Software is the problem

• Fault avoidance – good software engineering practices – difficult for large/complex software systems

– Impossible to fully test and verify if software is fault-free“Testing shows the presence, not the absence, of bugs”

- E. W. Dijkstra

• Yet there are stringent requirements for failure-free operation


High Reliability/Availability: Software is the problem (2)

Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software


Software Fault Tolerance Classical Techniques

Design diversity– N-version programming– Recovery block

Expensive not used much in practice!

Yet there are stringent requirements for failure-free operation

Challenge: Affordable Software Fault Tolerance


Outline

• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions


High availability SIP Application Server Configuration on IBM WebSphere

Blade 4

IP Sprayer-IBM Load Balancer

-

SIP

IBM PC

Replication

Group 3

Blade 2

Blade 3

Blade 4

Blade 2

AS 1

AS 2

AS 3

AS 4

AS 5

AS 1

AS 4

AS 2

AS 5

AS 6

AS 3

AS 6

Blade 3

Replication Domain 1



SIP

Proxy 1

SIP Proxy 1

Blade 1

Blade 1




Blade Chassis 1

Blade Chassis 2

Blade 4

Test Driver

Test Driver

Test drivers

DM

AS1 thru AS6 are Application Server

Proxy1's are Stateless Proxy Server

PRDC 2008 and ISSRE 2010 papers


High availability SIP Application Server configuration on WebSphere

Hardware configuration: – Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis

sufficient for performance)

Software configuration:– 2 copies of SIP/Proxy servers (1 sufficient for performance)

– 12 copies of WAS (6 sufficient for performance)

– Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis

• The system has hardware redundancy and software redundancy


High availability SIP Application Server configuration on WebSphere

Software Fault Tolerance– Identical copies of SIP proxy used as backups (hot spares)– Identical copies of WebSphere Applications Server (WAS) used

as backups (hot spares)– Type of software redundancy – (not design diversity) but

replication of identical software copies– Normal recovery

• restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked


Escalated levels of Recovery A real example Avaya Servers and Media Gateways

Single Process Restart (SPR)

IF 3 SPR within

60 seconds

NO

System Warm Restart (SWR)

YES

IF 3 SWR

within 15 min

NO System Cold Restart (SCR)

IF 3 SCR within 15 min

NO

Avaya Communication

Manager Software reloads

(ACMSR)

YES

YES

IF 3 ACMSR within 15 min

OS RebootYES

NO

The flowchart depicts the actions taken for recovery after a failure is detected. Try the simplest recovery method first, then a more complex etc.


Software Fault Tolerance: New Thinking

Retry, restart, reboot!

– Known to help in dealing with hardware transients

– Do they help in dealing with failures caused by software bugs?

– If yes, why?


A Cartoon

Why is this true………at least for computers?



Failover to an identical software replica (that is not a diverse version)

– Does it help?

– If yes, why?

Twenty years ago this would be considered crazy!


Outline

• Motivation• A Real System• Software Fault Classification

– Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007

• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions


Software Faults main threats to high reliability,

availability & safety


Copyright © 2013 by K.S. Trivedi31

IFIP Working Group 10.4 (Laprie)

• Failure occurs when the delivered service no longer complies with the desired output.

• Error is that part of the system state which is liable to lead to subsequent failure.

• Fault is adjudged or hypothesized cause of an error.

Faults are the cause of errors that may lead to failuresFault Error Failure


Need to Classify bug types

• We submit that a software fault tolerance approach based on retry, restart, reboot or fail-over to an identical software replica (not a diverse version) work because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs


Need to Classify bug types

• In recent years, researchers have reported the phenomenon of “software aging” (i.e., degraded performance and/or increased failure rate of long-running software systems).

• Puzzle: How can performance and failure rate change if the software code is not modified?!

Study software fault types and their relationships


Jim Gray’s Definitions

• The terms “Bohrbug” and “Heisenbug” were first used in print by Jim Gray in 1985.

• “Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring.”

• “Most production software faults are soft. If the program state is reinitializedand the failed operation is retried,

the operation will not fail a second time. … The assertion that most production software bugs are soft – Heisenbugs that go away when you look at them – is well known to systems programmers.” (Gray, 1985)

J. Gray


Bruce Lindsay’s Definition

• “Heisenbugs as originally defined … are bugs in which clearly the system behavior is incorrect,

and when you try to look to see why it’s incorrect, the problem goes away.” (Lindsay, 2004)

• The term alludes to the physicist Werner Heisenberg and his Uncertainty Principle.

Based on Gray’s paper, researchers have often equated Heisenbugs with soft faults.

However, when Bruce Lindsay originally coined the term in the 1960s (while working with Jim Gray), he had a more narrow definition in mind.

B. Lindsay, photo by T. Upton


Heisenbug – Our Definition

• Heisenbug := A fault that stops causing a failure or that manifests differently when one attempts to probe or isolate it.

• How can probing affect the bug?1. Some debuggers initialize unused

memory to default values, thus preventing failures due to improper initialization.

2. Trying to investigate a failure can influence process scheduling in such a way that a scheduling-related failure does not occur again.


Example: A bug causing a failure

A Classification of Software Faults

• Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity.

whenever the user enters a negative date of birth Since they are easily found, Bohrbugs tend to be

detected and fixed during the software testing phase.

The term alludes to the physicist Niels Bohr and his rather simple atomic model.


Mandelbug – Definition

• Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible.

• Example: A bug whose

activation is scheduling-dependent The residual faults in a thoroughly-tested

piece of software are mainly Mandelbugs. The term alludes to the mathematician Benoît

Mandelbrot and his research in fractal geometry.


41Mandelbugs: Consequences• Mandelbugs are difficult to detect and remove

during the software testing phase.• An operation that failed due to a Mandelbug may

execute correctly upon retry even if the fault has not been removed; changing the environment may suffice.

• Potential recovery techniques:– “Microreboot” of individual components– Application restart– System reboot– Failover to a standby component (replicate)– Manual recovery


Examples of Types of Bugs in IT System

• Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011.

• The projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector.

42


Aging-related Bug – Definition

• Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance.

Example: A bug causing memory leaks in the application

Note that the aging phenomenon requires a delay between fault activation and failure occurrence.

Note also that the software appears to age due to such a bug; there is no physical deterioration


Relationships

Bohrbug and Mandelbug are complementary antonyms.

Aging-related bugs are a subtype of Mandelbugs

Aging-Related Bugs

Bohrbugs

Mandelbugs

Aging

-

Related Bugs

Bohrbugs

Mandelbugs


Important Questions about these Bugs

• What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs– How do these fractions vary

• over time• over projects, languages, application types,…

– Need Measurements – Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary

results from one NASA software project: • 52% Bohrbugs• 35% Mandelbugs (non-aging-related)• 4% Aging-related bugs• 7% Operator related• 2% Unclassified

– Very similar results for Linux, MySQL, Apache AXIS, httpd• What are the methods of mitigation for the different fault types


Trends in SW Fault Type ProportionsPlanetary Missions Flight Software

• Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions analyzed)

• Result: The proportion of Bohrbugs seems to settle at around the same value. Such a convergence to similar values is less obvious for the other fault types.


Outline



Environmental diversityA new thinking to deal with software faults and failures




New thinking: Environmental Diversity as opposed to Design Diversity

Our claim is that this works since failures due to Mandelbugs are not negligible, we have an affordable software fault tolerance technique that we call

Environmental Diversity


What is environmental diversity?

• The underlying idea of Environmental diversity – Retry a previously faulty operation and it works– Why?– because of the environment where the operation was

executed has changed enough to avoid the fault activation.

• The environment is understood as– OS resources, other applications running concurrently and

sharing the same resources, interleaving of operations, concurrency, or synchronization.


What is environmental diversity?

• The execution of an application depends on the environment

Hardware

Operating System

Ap1 Ap2 Ap3 Ap4

Hardware

Operating System

Ap4 Ap6 Ap3 Ap5

Restart the application

Environment at time t1Environment at time t1+n


Environmental Diversity• Restart an application, reboot a node or failover to an

identical standby replica work because of the environmental diversity that will be underlying these actions;– By environment here we mean the resources of the OS, other

applications running concurrently and sharing system resources, interleaving of operations, concurrency, synchronization etc.

• Environmental Diversity uses time redundancy over expensive design diversity

• [Adams] Restart• [Jalote et al.] Rollback, rollforward • [Patriot] Occasional reboot, “switch off and on”• [Avaya Swift] restart process; failover to a replica• [IBM SIP] escalated levels: restart, reboot, failover…• [IBM Director-X-series] Rejuvenation


Outline



Methods of Mitigation



Mitigation


Bohrbugs: Remove

Find and fix the bugs during testing

Failure data collected during testing

Calibrate a software reliability growth model (SRGM) using failure data; this model is then used for prediction

Many SRGMs exist (JM,NHPP,HGRGM, etc.) Books by Lyu, Musa, Cai

Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals of Software Engineering, 1999

Measurements Empirical (statistical) models


Mitigation


OS Availability Model (IBM BladeCenter)

UP DN DWlOS

mOS

RPasp

bOSbOS

(1-bOS)bOSDT

dOS

Reboot (Failure due to a Mandelbug)

Fix (Failed due to a

Bohrbug)


Availability model of a Proxy or a WAS (IBM SIP on websphere)

UA UR UB(1-r)rm

rrmqra

(1-q)ra

bbm

RE(1-b)bm

m

UOUPg ed2

1D

ed2dd1

(1-e)d2

UN

dm

1N

(1-d)d1

(1-e)d2

2N (1-d)d1

(1-e)d2

ed2 dd1

Application server and proxy server

• Failure detection

– By WLM

– By Node Agent

– Manual detection

• Recovery

– Node Agent• Auto process restart

– Manual recovery• Process restart• Node reboot• Repair


Outline



Aging Related Bugs: Replicate, Restart, Reboot, Rejuvenate



Software Aging

Aging phenomenonError conditions accumulating over time

Main causes of Software AgingMemory leak, fragmentation, Unterminated threads, Data corruption, Round-

off errors, Unreleased file-locks, etc

Observed systemOS, Middle-ware, Netscape, Internet Explorer etc

Performance degradation, system failurePerformance degradation, system failure


Software Aging - Definition

“Software Aging” phenomenon

Long-running software tends to show an increasing failure rate.

Not related to application program becoming obsolete due to changing requirements/maintenance.

Software appears to age; no real deterioration


Software Aging - Examples

• Cisco Catalyst Switch [Matias Jr.]• File system aging [Smith & Seltzer]• Gradual service degradation in the AT&T transaction processing

system [Avritzer et al.]• Error accumulation in Patriot missile system’s software [Marshall] • Resources exhaustion in Apache [Li et al., Grottke et al.]• Physical memory degradation in a SOAP-based Server [Silva et al.]• Software aging in Linux [Cotroneo et al.]• Crash/hang failures in general purpose applications after a long

runtime


Measurements Showing Resource Exhaustion or Depletion

Real Memory Free File Table Size

A Methodology for Detection and Estimation of Software Aging,

S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi.

Proc. of IEEE Intl. Symp. on Software Reliability Engineering, Nov. 1998.


Software Fault Types & Their Mitigation


Software rejuvenation

Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes.

It allows proactive recovery to be carried either automatically or at the discretion of the user/administrator

Rejuvenation of the environment, not of software


Software Rejuvenation

Counteracts the software aging phenomenonFrees up OS resources; Removes error accumulation

Common techniques for cleaningGarbage collection, defragmentation, flushing kernel and file server tables etc.

Challenge: Rejuvenation scheduling/granularity


SW Rejuvenation: The Genesis

“Software Rejuvenation: Analysis, Module and Applications”, Y. Huang, C. Kintala, N. Koletis, N. Fulton, in FTCS 1995

An insight into operational software, that no-one had before (at least, formally). It changed

• How practitioners looked at making software more dependable

• Windfall of performance and dependability modeling problems for academicians

• Ideas to build better, real-world systems as Internet evolved

• Led to recognition of “software aging” phenomenon

• Brought about Phds, tenureships, publications, patents, awards, tools, systems, funding for many many people around the world.


Software RejuvenationExamples

AT&T billing applications [Huang et al.]

Patriot missile system software - switch off/on every 8 hours [Marshall]

On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.]

IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers]

Microsoft IIS 5.0 process recycling toolProcess restart in Apache [Li et al.]

ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS reports]

For more examples: "Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi, Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas, USA, 2012.


Software Rejuvenation –Trade-off

• Advantages– Reduces costs of sudden aging-related failures– Can be applied at the discretion of the user/administrator

• Disadvantages– Direct costs of carrying out rejuvenation– Opportunity costs of rejuvenation (downtime, decreased

performance, lost transactions etc)

Important research issue: Find optimal times to perform rejuvenation!


Software rejuvenation - Approaches

Two approaches based on WHEN:

Time-Based rejuvenation approaches

Rejuvenation applied regularly and at predetermined time intervals.

Widely used in real environments

– Web servers (Apache)– ISS two-months reboot– Telecommunication systems


Software rejuvenation - Approaches

Two approaches based on WHEN:

Measurement (Inspection)-based rejuvenation approaches

• Threshold based or predictive

• System metrics continually monitored

• Rejuvenation triggered when the crash is imminent based on the observation/prediction

• Reduce potentially useless rejuvenation actions and downtime in the process [Silva et al.]


Software rejuvenation


Software Rejuvenation – Approaches

Two approaches based on HOW:

Use analytical model to optimize rejuvenation schedule• Lucent Bell Labs [Huang et al., ‘95]

• Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]

• Others [IPDS’98, PNPM’99]


Software Rejuvenation – Approaches

Two approaches based on HOW:

Use measurements of resource degradation to determine/predict optimal rejuvenation schedule

• Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05]

• Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10]


Failure rate

Preventive maintenance is useful only if failure rate is increasing

If the time to failure distribution is exponential then failure rate is Constant

Need to assume (and establish) that TTF is IFR


Analytic Models

Single node models– CTMC model – SMP model

Cluster systems– IBM Cluster model (Time-based, condition-based)


Analytic ModelsSoftware Aging and Rejuvenation

A simple and useful model of increasing failure rate:Failure

probable state

Failed state

Time to failure: Hypo-exponential distribution

Increasing failure rate aging

Robust state


Analytic ModelsCTMC model [Huang95]

From this Continuous-time Markov chain modelCan find closed-form expression for the optimal rejuvenation trigger rate

(r4)

Model with rejuvenationModel w/o rejuvenation

Failed stater1

λ

r2

Failure probable state

Robust state

Failed state

Sf Sp

S0

λ

r2

r1 r3

r4

Robust state

Failure probable state

Rejuvenation

stateSf Sp

S0

Sr


Analytic ModelsSemi-Markov model [Dohi00]

Relax the assumption of exponentially distributed sojourn times (time-independent transition rates)

Hence have a semi Markov model

Can find closed-form expression for the optimal (deterministic) time to rejuvenation triggerSystem Failure

State change

Completion of Repair

Completion of Rejuvenation

Rejuvenation2 1

0

3


Rejuvenation in Cluster Systems

[Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself

Easier scaling to larger systems, high levels of availability/performance and low management costs

Cluster System


Rejuvenation for Cluster SystemsMotivation

Rejuvenation using the fail-over mechanisms

Long-terms benefits in terms of availability/performance

Continuous operation (possibly at a degraded level)

Practically zero downtime


Rejuvenation for Cluster SystemsMotivation

Less disruptive and lower overhead than unplanned outages

Transparent to user/application

Most current industry initiatives reactive

Two approachesSimple time-based (periodic)Condition-based (only from the “failure-impending” state


Rejuvenation for Cluster SystemsSRN Models

Rejuvenation using the fail-over mechanisms in a rolling fashion

Modeling using SRNs (Stochastic Reward Nets)

Analysis for 2 rejuvenation policies Simple time-based policy

• All nodes rejuvenated successively at the end of each rejuvenation intervalCondition-based policy

• Nodes rejuvenated only from the “failure-probable” state


SRN ModelBasic Cluster Model


SRN ModelSimple Time-Based Rejuvenation


Model Parameters

Transition Mean time

Tfprob 240 hours

Tnodefail 720 hours

Tnoderepair 30 mins

Tsysrepair 4 hours

Trejuv 10 mins

costnodefail $5000/hour

costnoderejuv $250/hour


Model Measures

Unavailability

(#Psysfail == 1) ? 1 : 0

Cost#Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail

Measures Computed


ResultsSimple Time-Based Rejuvenation

8/1 configuration 8/2 configuration

As rejuv. int. increases, rejuvenation is performed less frequently

When rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtime

When rejuv. int. goes beyond optimal value, system failures become frequent

resulting in high cost/downtime


Measuring Performance Variables

ObjectiveDetection and validation of aging

Periodically monitor and collect data on the attributes responsible for the “health” of the system

Quantify the effect of aging on system resourcesProposed metric – Estimated time to exhaustionProposed metric – Evaluation function using PCA approach


Measuring Performance Variables

ApproachesTime-based (workload-independent) estimation [Garg98]

Workload-based estimation [Vaidyanathan99]

ARMA/ARX models [Li02]

ALT and ADT techniques [Matias06]

Non-parametric Algorithms [Dohi00]

Non-linear models [Hoffman07]

Principal component Analysis (PCA) and System identification [Jia]

Pattern recognition [Vaidyanathan & Gross]

Threshold-based approaches [Silva09]

Machine Learning Approaches [Alonso10]


97

Data CollectionExperimental Setup

SNMP-based resource monitoring tool:

Data related to OS resource usage (memory, process table, file table etc.) and system activity (CPU usage, paging, swapping, NFS, interrupts etc. ) collected for over 3 months at 10 min intervals


98

Time PlotsNon-parametric Regression Smoothing

Real Memory Free File Table Size

Trend detection: Seasonal Kendall test for trend


IBM xSeries Software Rejuvenation Agent (SRA)

Implemented in a high-availability clustered environment

Monitors consumable resources, estimate time to exhaustion and generates alerts if within user notification horizon


IBM xSeries Software Rejuvenation Agent (SRA)

IBM Director system management tool– Provides GUI to configure SRA– Acts upon alerts

Two versions– Periodic rejuvenation– Prediction-based rejuvenation


Summary

It is possible to enhance software availability during operation exploiting environmental diversity

Multiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch)


Summary

Software aging not anecdotal – real life scientific phenomenon

Rejuvenation implemented in several special purpose applications and many general purpose cluster systems


Key References• Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis

and N. Fulton, In Proc. FTCS-25, June 1995. • A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K.

Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998. • Performance and Reliability Evaluation of Passive Replication Schemes in Application Level

Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS 1999.

• Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.

• Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, March 2001.

• A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-TDSC, April-June 2005.

• Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S. Trivedi, IEEE Trans. Reliability, Sept. 2006.


Key References

• Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, Feb. 2007.

• Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008.

• Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI, K., Maciel, P. , ISSRE, 2010.

• Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S. Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010.

• An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010.

• Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011.

• Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011

• O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book)• M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE

1996• R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based

Telecommunications System, ISSRE 1992

Technology

Software Faults, Failures and Their Mitigations | Turing100@Persistent