98
Copyright © 2013 by K.S. Trivedi Copyright © 2013 by K.S. Trivedi Kishor Trivedi Duke High Availability Assurance Lab (DHAAL) Dept. of Electrical & Computer Engineering Duke University Durham, NC 27708 [email protected] www.ee.duke.edu/~kst Software Faults, Failures and Their Mitigations Persistent Systems January 5, 2013

Software Faults, Failures and Their Mitigations | Turing100@Persistent

Embed Size (px)

DESCRIPTION

Prof Kishor Trivedi, Duke University, USA talks about Software Faults, Failures and Their Mitigations at the 6th Session of Turing100@Persistent

Citation preview

Page 1: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Copyright © 2013 by K.S. Trivedi

Kishor TrivediDuke High Availability Assurance Lab (DHAAL)

Dept. of Electrical & Computer EngineeringDuke University

Durham, NC [email protected]

www.ee.duke.edu/~kst

Software Faults, Failures and Their Mitigations

Persistent SystemsJanuary 5, 2013

Page 2: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi2

Duke UniversityResearch Triangle Park (RTP)

UNC-CHUNC-CH

DukeDuke

NC state

NC state

USAUSA North Carolina

North Carolina

Page 3: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Duke University

• U.S. News & World Report in its 2012 edition, – ranked the university's undergraduate program

8th among all universities in USA

3

Page 4: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Duke University

4

NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)

Page 5: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

HARP (NASA), SAVE (IBM),IRAP (Boeing)SHARPE, SPNP, SREPT

TheoryTheory

ApplicationsApplications

Books: Blue, Red, White

Software Software PackagesPackages

Stochastic modeling methods & numerical solution methods: Large Fault trees, Stochastic Petri Nets,Large/stiff Markov & non-Markov modelsFluid stochastic modelsPerformability & Markov reward modelsSoftware aging and rejuvenationSecurity, Survivability, Resilience quantificationMachine Learning

Trivedi’s research triangle

5

Reliability/availability/performanceAvionics (Boeing), Space (NASA/JPL), Power systems (GE), Automobile systems (GM) Computer systems (EMC,SUN,HP,TCS)Telco systems (AT&T, Lucent, Avaya)Computer Networks (Motorola)Virtualized Data center (NEC)Cloud computing (IBM, NEC, Cisco)Software Aging & Rejuvenation (Huawei)

Page 6: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Textbooks

Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Second edition,

John Wiley, 2001 (Bluebook) [First edition by Prentice-Hall, 1982]

Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the

SHARPE Software Package, Kluwer (now Springer), 1996 (Redbook)

Queuing Networks and Markov Chains, John Wiley, second edition, 2006 (White book)

6

Page 7: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

DHAAL & IndustryA Success Story

• Reliability Prediction of Boeing 787 Current Return Network for FAA Certification

• Security Quantification (DARPA SITAR)/NSF• Survivability Quantification for Lucent POTS and Siemens Smartgrid• Cloud performance, availability, power with IBM and Cisco• Reliability/Availability Prediction of SIP protocol on IBM WebSphere• Software Aging and Rejuvenation: state of the art, theory, measurements,

and implementation (IBM x-series); Huawei• Cloud computing security (Measurements, quantification, and

implementation) - NATO Science for Peace and Security• NASA-JPL Failures data analytics• NEC collaboration for Performability Management (VMs allocation, VMs

and VMM rejuvenation, etc) in Virtualized Data Center• Data analytics (statistical and machine learning techniques used to

interpret huge volume of data) - WiPro Technologies• Software Reliability/Availability/Performability analysis – Short courses,

seminars, and consulting - Tata consulting services

77

Page 8: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Outline

• Motivation • A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions

Page 9: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Pervasive Dependence on Computer SystemsNeed for High Reliability/Availability

Health & Medicine Avionics

Entertainment

Banking

Communication

Page 10: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Basic Definitions

• Steady-state availability (Ass) or just availability Long-term probability that the system is available when

requested:

MTTF is the system mean time to failure, a complex combination of component MTTFs

MTTR is the system mean time to recovery- may consist of many phases

MTTRMTTF

MTTF

ssA

Page 11: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Basic Definitions

• Downtime in minutes per year(un)availability is usually presented in terms of annual downtime.

– Downtime = 876060 (1- Ass) minutes.

– 5 NINES (Ass = 0.99999) 5.26 minutes annual downtime

Page 12: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Number of Nines– Reality Check

• 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week

– Approx. 80 hours/year=4800 minutes/year

– Ass=(8760-80)/8760=0.9908

– That is, between 2 NINES and 3 NINES!

• This study assumes planned and unplanned downtime, together

Page 13: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Achieving High Availability is a Challenge

• Black Sept. 2011, In the same week!!!!:– Microsoft Cloud service outage (2.5 hours)– Google Docs service outage (1 hour)

• A memory leak due to a software update

• Sept. 2012 GoDaddy (4 hours)– 5 millions of websites affected

• Oct. 2012 Amazon – 10/15/2012 Webservices – 6 hours (Memory leak)– 10/27/2012 EC2 – > 2 hours

Page 14: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Downtown Costs per Hour

• Brokerage operations $6,450,000• Credit card authorization $2,600,000• eBay (1 outage 22 hours) $225,000• Amazon.com $180,000• Package shipping services $150,000• Home shopping channel $113,000• Catalog sales center $90,000• Airline reservation center $89,000• Cellular service activation $41,000• On-line network fees $25,000• ATM service fees $14,000

Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."

Page 15: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

High Reliability/Availability

• Hardware fault tolerance, fault management, reliability/availability modeling/assurance relatively well developed

• System outages more due to software faults

• Key Challenge:

Software reliability is one of the weakest links in system

reliability/availability

Page 16: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software is the problem

161985

2005

Across different industries….

Jim Gray’s paper titled “Why do computers stop and what can be done about it?”started to pointed out this trend in 1985, followed by his paper “A census of tandem system availability between 1985 and 1990”

Page 17: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Increasing SW Failure Rate?Planetary Missions Flight Software: A. Nikora of JPL

Mission Name (in launch order)

Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep MarsGlobal Climate Polar Odyssey Exploration Impact Reconnaissance

Surveyor Orbiter Lander Rover Orbiter

The interval between the first and last launch: 8.76 years.

The interval between successive launches ranges from:23 to 790 days.

Similar results for ground software

Page 18: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

High Reliability/Availability: Software is the problem

• Fault avoidance – good software engineering practices – difficult for large/complex software systems

– Impossible to fully test and verify if software is fault-free“Testing shows the presence, not the absence, of bugs”

- E. W. Dijkstra

• Yet there are stringent requirements for failure-free operation

Page 19: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

High Reliability/Availability: Software is the problem (2)

Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software

Page 20: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Fault Tolerance Classical Techniques

Design diversity– N-version programming– Recovery block

Expensive not used much in practice!

Yet there are stringent requirements for failure-free operation

Challenge: Affordable Software Fault Tolerance

Page 21: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Outline

• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions

Page 22: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

High availability SIP Application Server Configuration on IBM WebSphere

Blade 4

IP Sprayer-IBM Load Balancer

-

SIP

IBM PC

Replication

Group 3

Blade 2

Blade 3

Blade 4

Blade 2

AS 1

AS 2

AS 3

AS 4

AS 5

AS 1

AS 4

AS 2

AS 5

AS 6

AS 3

AS 6

Blade 3

Replication Domain 1

Replication Domain 2

Replication Domain 3

SIP

Proxy 1

SIP Proxy 1

Blade 1

Blade 1

Replication Domain 4

Replication Domain 5

Replication Domain 6

Blade Chassis 1

Blade Chassis 2

Blade 4

Test Driver

Test Driver

Test drivers

DM

AS1 thru AS6 are Application Server

Proxy1's are Stateless Proxy Server

PRDC 2008 and ISSRE 2010 papers

Page 23: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

High availability SIP Application Server configuration on WebSphere

Hardware configuration: – Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis

sufficient for performance)

Software configuration:– 2 copies of SIP/Proxy servers (1 sufficient for performance)

– 12 copies of WAS (6 sufficient for performance)

– Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis

• The system has hardware redundancy and software redundancy

Page 24: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

High availability SIP Application Server configuration on WebSphere

Software Fault Tolerance– Identical copies of SIP proxy used as backups (hot spares)– Identical copies of WebSphere Applications Server (WAS) used

as backups (hot spares)– Type of software redundancy – (not design diversity) but

replication of identical software copies– Normal recovery

• restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked

Page 25: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Escalated levels of Recovery A real example Avaya Servers and Media Gateways

Single Process Restart (SPR)

IF 3 SPR within

60 seconds

NO

System Warm Restart (SWR)

YES

IF 3 SWR

within 15 min

NO System Cold Restart (SCR)

IF 3 SCR within 15 min

NO

Avaya Communication

Manager Software reloads

(ACMSR)

YES

YES

IF 3 ACMSR within 15 min

OS RebootYES

NO

The flowchart depicts the actions taken for recovery after a failure is detected. Try the simplest recovery method first, then a more complex etc.

Page 26: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Fault Tolerance: New Thinking

Retry, restart, reboot!

– Known to help in dealing with hardware transients

– Do they help in dealing with failures caused by software bugs?

– If yes, why?

Page 27: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

A Cartoon

Why is this true………at least for computers?

Page 28: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Fault Tolerance: New Thinking

Failover to an identical software replica (that is not a diverse version)

– Does it help?

– If yes, why?

Twenty years ago this would be considered crazy!

Page 29: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Outline

• Motivation• A Real System• Software Fault Classification

– Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007

• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions

Page 30: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Faults main threats to high reliability,

availability & safety

Copyright © 2012 by K.S. Trivedi

Page 31: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi31

IFIP Working Group 10.4 (Laprie)

• Failure occurs when the delivered service no longer complies with the desired output.

• Error is that part of the system state which is liable to lead to subsequent failure.

• Fault is adjudged or hypothesized cause of an error.

Faults are the cause of errors that may lead to failuresFault Error Failure

Page 32: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Need to Classify bug types

• We submit that a software fault tolerance approach based on retry, restart, reboot or fail-over to an identical software replica (not a diverse version) work because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs

Page 33: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Need to Classify bug types

• In recent years, researchers have reported the phenomenon of “software aging” (i.e., degraded performance and/or increased failure rate of long-running software systems).

• Puzzle: How can performance and failure rate change if the software code is not modified?!

Study software fault types and their relationships

Page 34: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Jim Gray’s Definitions

• The terms “Bohrbug” and “Heisenbug” were first used in print by Jim Gray in 1985.

• “Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring.”

• “Most production software faults are soft. If the program state is reinitializedand the failed operation is retried,

the operation will not fail a second time. … The assertion that most production software bugs are soft – Heisenbugs that go away when you look at them – is well known to systems programmers.” (Gray, 1985)

J. Gray

Page 35: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Bruce Lindsay’s Definition

• “Heisenbugs as originally defined … are bugs in which clearly the system behavior is incorrect,

and when you try to look to see why it’s incorrect, the problem goes away.” (Lindsay, 2004)

• The term alludes to the physicist Werner Heisenberg and his Uncertainty Principle.

Based on Gray’s paper, researchers have often equated Heisenbugs with soft faults.

However, when Bruce Lindsay originally coined the term in the 1960s (while working with Jim Gray), he had a more narrow definition in mind.

B. Lindsay, photo by T. Upton

Page 36: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Heisenbug – Our Definition

• Heisenbug := A fault that stops causing a failure or that manifests differently when one attempts to probe or isolate it.

• How can probing affect the bug?1. Some debuggers initialize unused

memory to default values, thus preventing failures due to improper initialization.

2. Trying to investigate a failure can influence process scheduling in such a way that a scheduling-related failure does not occur again.

Page 37: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Example: A bug causing a failure

A Classification of Software Faults

• Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity.

whenever the user enters a negative date of birth Since they are easily found, Bohrbugs tend to be

detected and fixed during the software testing phase.

The term alludes to the physicist Niels Bohr and his rather simple atomic model.

Page 38: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Mandelbug – Definition

• Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible.

• Example: A bug whose

activation is scheduling-dependent The residual faults in a thoroughly-tested

piece of software are mainly Mandelbugs. The term alludes to the mathematician Benoît

Mandelbrot and his research in fractal geometry.

Page 39: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

41Mandelbugs: Consequences• Mandelbugs are difficult to detect and remove

during the software testing phase.• An operation that failed due to a Mandelbug may

execute correctly upon retry even if the fault has not been removed; changing the environment may suffice.

• Potential recovery techniques:– “Microreboot” of individual components– Application restart– System reboot– Failover to a standby component (replicate)– Manual recovery

Page 40: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Examples of Types of Bugs in IT System

• Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011.

• The projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector.

42

Page 41: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Aging-related Bug – Definition

• Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance.

Example: A bug causing memory leaks in the application

Note that the aging phenomenon requires a delay between fault activation and failure occurrence.

Note also that the software appears to age due to such a bug; there is no physical deterioration

Page 42: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Relationships

Bohrbug and Mandelbug are complementary antonyms.

Aging-related bugs are a subtype of Mandelbugs

Aging-Related Bugs

Bohrbugs

Mandelbugs

Aging

-

Related Bugs

Bohrbugs

Mandelbugs

Page 43: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Important Questions about these Bugs

• What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs– How do these fractions vary

• over time• over projects, languages, application types,…

– Need Measurements – Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary

results from one NASA software project: • 52% Bohrbugs• 35% Mandelbugs (non-aging-related)• 4% Aging-related bugs• 7% Operator related• 2% Unclassified

– Very similar results for Linux, MySQL, Apache AXIS, httpd• What are the methods of mitigation for the different fault types

Page 44: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Trends in SW Fault Type ProportionsPlanetary Missions Flight Software

• Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions analyzed)

• Result: The proportion of Bohrbugs seems to settle at around the same value. Such a convergence to similar values is less obvious for the other fault types.

Page 45: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Outline

• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions

Page 46: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Environmental diversityA new thinking to deal with software faults and failures

Copyright © 2012 by K.S. Trivedi

Page 47: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Fault Tolerance: New Thinking

New thinking: Environmental Diversity as opposed to Design Diversity

Our claim is that this works since failures due to Mandelbugs are not negligible, we have an affordable software fault tolerance technique that we call

Environmental Diversity

Page 48: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

What is environmental diversity?

• The underlying idea of Environmental diversity – Retry a previously faulty operation and it works– Why?– because of the environment where the operation was

executed has changed enough to avoid the fault activation.

• The environment is understood as– OS resources, other applications running concurrently and

sharing the same resources, interleaving of operations, concurrency, or synchronization.

Page 49: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

What is environmental diversity?

• The execution of an application depends on the environment

Hardware

Operating System

Ap1 Ap2 Ap3 Ap4

Hardware

Operating System

Ap4 Ap6 Ap3 Ap5

Restart the application

Environment at time t1Environment at time t1+n

Page 50: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Environmental Diversity• Restart an application, reboot a node or failover to an

identical standby replica work because of the environmental diversity that will be underlying these actions;– By environment here we mean the resources of the OS, other

applications running concurrently and sharing system resources, interleaving of operations, concurrency, synchronization etc.

• Environmental Diversity uses time redundancy over expensive design diversity

• [Adams] Restart• [Jalote et al.] Rollback, rollforward • [Patriot] Occasional reboot, “switch off and on”• [Avaya Swift] restart process; failover to a replica• [IBM SIP] escalated levels: restart, reboot, failover…• [IBM Director-X-series] Rejuvenation

Page 51: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Outline

• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions

Page 52: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Methods of Mitigation

Copyright © 2012 by K.S. Trivedi

Page 53: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Mitigation

Page 54: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Bohrbugs: Remove

Find and fix the bugs during testing

Failure data collected during testing

Calibrate a software reliability growth model (SRGM) using failure data; this model is then used for prediction

Many SRGMs exist (JM,NHPP,HGRGM, etc.) Books by Lyu, Musa, Cai

Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals of Software Engineering, 1999

Measurements Empirical (statistical) models

Page 55: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Mitigation

Page 56: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

OS Availability Model (IBM BladeCenter)

UP DN DWlOS

mOS

RPasp

bOSbOS

(1-bOS)bOSDT

dOS

Reboot (Failure due to a Mandelbug)

Fix (Failed due to a

Bohrbug)

Page 57: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Availability model of a Proxy or a WAS (IBM SIP on websphere)

UA UR UB(1-r)rm

rrmqra

(1-q)ra

bbm

RE(1-b)bm

m

UOUPg ed2

1D

ed2dd1

(1-e)d2

UN

dm

1N

(1-d)d1

(1-e)d2

2N (1-d)d1

(1-e)d2

ed2 dd1

Application server and proxy server

• Failure detection

– By WLM

– By Node Agent

– Manual detection

• Recovery

– Node Agent• Auto process restart

– Manual recovery• Process restart• Node reboot• Repair

Page 58: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Outline

• Motivation• A Real System• Software Fault Classification• Environmental Diversity • Methods of Mitigation• Software Aging and Rejuvenation• Conclusions

Page 59: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Aging Related Bugs: Replicate, Restart, Reboot, Rejuvenate

Copyright © 2012 by K.S. Trivedi

Page 60: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Aging

Aging phenomenonError conditions accumulating over time

Main causes of Software AgingMemory leak, fragmentation, Unterminated threads, Data corruption, Round-

off errors, Unreleased file-locks, etc

Observed systemOS, Middle-ware, Netscape, Internet Explorer etc

Performance degradation, system failurePerformance degradation, system failure

Page 61: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Aging - Definition

“Software Aging” phenomenon

Long-running software tends to show an increasing failure rate.

Not related to application program becoming obsolete due to changing requirements/maintenance.

Software appears to age; no real deterioration

Page 62: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Aging - Examples

• Cisco Catalyst Switch [Matias Jr.]• File system aging [Smith & Seltzer]• Gradual service degradation in the AT&T transaction processing

system [Avritzer et al.]• Error accumulation in Patriot missile system’s software [Marshall] • Resources exhaustion in Apache [Li et al., Grottke et al.]• Physical memory degradation in a SOAP-based Server [Silva et al.]• Software aging in Linux [Cotroneo et al.]• Crash/hang failures in general purpose applications after a long

runtime

Page 63: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Measurements Showing Resource Exhaustion or Depletion

Real Memory Free File Table Size

A Methodology for Detection and Estimation of Software Aging,

S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi.

Proc. of IEEE Intl. Symp. on Software Reliability Engineering, Nov. 1998.

Page 64: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Fault Types & Their Mitigation

Page 65: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software rejuvenation

Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes.

It allows proactive recovery to be carried either automatically or at the discretion of the user/administrator

Rejuvenation of the environment, not of software

Page 66: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Rejuvenation

Counteracts the software aging phenomenonFrees up OS resources; Removes error accumulation

Common techniques for cleaningGarbage collection, defragmentation, flushing kernel and file server tables etc.

Challenge: Rejuvenation scheduling/granularity

Page 67: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

SW Rejuvenation: The Genesis

“Software Rejuvenation: Analysis, Module and Applications”, Y. Huang, C. Kintala, N. Koletis, N. Fulton, in FTCS 1995

An insight into operational software, that no-one had before (at least, formally). It changed

• How practitioners looked at making software more dependable

• Windfall of performance and dependability modeling problems for academicians

• Ideas to build better, real-world systems as Internet evolved

• Led to recognition of “software aging” phenomenon

• Brought about Phds, tenureships, publications, patents, awards, tools, systems, funding for many many people around the world.

Page 68: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software RejuvenationExamples

AT&T billing applications [Huang et al.]

Patriot missile system software - switch off/on every 8 hours [Marshall]

On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.]

IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers]

Microsoft IIS 5.0 process recycling toolProcess restart in Apache [Li et al.]

ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS reports]

For more examples: "Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi, Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas, USA, 2012.

Page 69: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Rejuvenation –Trade-off

• Advantages– Reduces costs of sudden aging-related failures– Can be applied at the discretion of the user/administrator

• Disadvantages– Direct costs of carrying out rejuvenation– Opportunity costs of rejuvenation (downtime, decreased

performance, lost transactions etc)

Important research issue: Find optimal times to perform rejuvenation!

Page 70: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software rejuvenation - Approaches

Two approaches based on WHEN:

Time-Based rejuvenation approaches

Rejuvenation applied regularly and at predetermined time intervals.

Widely used in real environments

– Web servers (Apache)– ISS two-months reboot– Telecommunication systems

Page 71: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software rejuvenation - Approaches

Two approaches based on WHEN:

Measurement (Inspection)-based rejuvenation approaches

• Threshold based or predictive

• System metrics continually monitored

• Rejuvenation triggered when the crash is imminent based on the observation/prediction

• Reduce potentially useless rejuvenation actions and downtime in the process [Silva et al.]

Page 72: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software rejuvenation

Page 73: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Rejuvenation – Approaches

Two approaches based on HOW:

Use analytical model to optimize rejuvenation schedule• Lucent Bell Labs [Huang et al., ‘95]

• Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]

• Others [IPDS’98, PNPM’99]

Page 74: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Software Rejuvenation – Approaches

Two approaches based on HOW:

Use measurements of resource degradation to determine/predict optimal rejuvenation schedule

• Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05]

• Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10]

Page 75: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Failure rate

Preventive maintenance is useful only if failure rate is increasing

If the time to failure distribution is exponential then failure rate is Constant

Need to assume (and establish) that TTF is IFR

Page 76: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Analytic Models

Single node models– CTMC model – SMP model

Cluster systems– IBM Cluster model (Time-based, condition-based)

Page 77: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Analytic ModelsSoftware Aging and Rejuvenation

A simple and useful model of increasing failure rate:Failure

probable state

Failed state

Time to failure: Hypo-exponential distribution

Increasing failure rate aging

Robust state

Page 78: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Analytic ModelsCTMC model [Huang95]

From this Continuous-time Markov chain modelCan find closed-form expression for the optimal rejuvenation trigger rate

(r4)

Model with rejuvenationModel w/o rejuvenation

Failed stater1

λ

r2

Failure probable state

Robust state

Failed state

Sf Sp

S0

λ

r2

r1 r3

r4

Robust state

Failure probable state

Rejuvenation

stateSf Sp

S0

Sr

Page 79: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Analytic ModelsSemi-Markov model [Dohi00]

Relax the assumption of exponentially distributed sojourn times (time-independent transition rates)

Hence have a semi Markov model

Can find closed-form expression for the optimal (deterministic) time to rejuvenation triggerSystem Failure

State change

Completion of Repair

Completion of Rejuvenation

Rejuvenation2 1

0

3

Page 80: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Rejuvenation in Cluster Systems

[Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself

Easier scaling to larger systems, high levels of availability/performance and low management costs

Cluster System

Page 81: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Rejuvenation for Cluster SystemsMotivation

Rejuvenation using the fail-over mechanisms

Long-terms benefits in terms of availability/performance

Continuous operation (possibly at a degraded level)

Practically zero downtime

Page 82: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Rejuvenation for Cluster SystemsMotivation

Less disruptive and lower overhead than unplanned outages

Transparent to user/application

Most current industry initiatives reactive

Two approachesSimple time-based (periodic)Condition-based (only from the “failure-impending” state

Page 83: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Rejuvenation for Cluster SystemsSRN Models

Rejuvenation using the fail-over mechanisms in a rolling fashion

Modeling using SRNs (Stochastic Reward Nets)

Analysis for 2 rejuvenation policies Simple time-based policy

• All nodes rejuvenated successively at the end of each rejuvenation intervalCondition-based policy

• Nodes rejuvenated only from the “failure-probable” state

Page 84: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

SRN ModelBasic Cluster Model

Page 85: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

SRN ModelSimple Time-Based Rejuvenation

Page 86: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Model Parameters

Transition Mean time

Tfprob 240 hours

Tnodefail 720 hours

Tnoderepair 30 mins

Tsysrepair 4 hours

Trejuv 10 mins

costnodefail $5000/hour

costnoderejuv $250/hour

Page 87: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Model Measures

Unavailability

(#Psysfail == 1) ? 1 : 0

Cost#Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail

Measures Computed

Page 88: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

ResultsSimple Time-Based Rejuvenation

8/1 configuration 8/2 configuration

As rejuv. int. increases, rejuvenation is performed less frequently

When rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtime

When rejuv. int. goes beyond optimal value, system failures become frequent

resulting in high cost/downtime

Page 89: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Measuring Performance Variables

ObjectiveDetection and validation of aging

Periodically monitor and collect data on the attributes responsible for the “health” of the system

Quantify the effect of aging on system resourcesProposed metric – Estimated time to exhaustionProposed metric – Evaluation function using PCA approach

Page 90: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Measuring Performance Variables

ApproachesTime-based (workload-independent) estimation [Garg98]

Workload-based estimation [Vaidyanathan99]

ARMA/ARX models [Li02]

ALT and ADT techniques [Matias06]

Non-parametric Algorithms [Dohi00]

Non-linear models [Hoffman07]

Principal component Analysis (PCA) and System identification [Jia]

Pattern recognition [Vaidyanathan & Gross]

Threshold-based approaches [Silva09]

Machine Learning Approaches [Alonso10]

Page 91: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

97

Data CollectionExperimental Setup

SNMP-based resource monitoring tool:

Data related to OS resource usage (memory, process table, file table etc.) and system activity (CPU usage, paging, swapping, NFS, interrupts etc. ) collected for over 3 months at 10 min intervals

Page 92: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

98

Time PlotsNon-parametric Regression Smoothing

Real Memory Free File Table Size

Trend detection: Seasonal Kendall test for trend

Page 93: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

IBM xSeries Software Rejuvenation Agent (SRA)

Implemented in a high-availability clustered environment

Monitors consumable resources, estimate time to exhaustion and generates alerts if within user notification horizon

Page 94: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

IBM xSeries Software Rejuvenation Agent (SRA)

IBM Director system management tool– Provides GUI to configure SRA– Acts upon alerts

Two versions– Periodic rejuvenation– Prediction-based rejuvenation

Page 95: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Summary

It is possible to enhance software availability during operation exploiting environmental diversity

Multiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch)

Page 96: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Summary

Software aging not anecdotal – real life scientific phenomenon

Rejuvenation implemented in several special purpose applications and many general purpose cluster systems

Page 97: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Key References• Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis

and N. Fulton, In Proc. FTCS-25, June 1995. • A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K.

Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998. • Performance and Reliability Evaluation of Passive Replication Schemes in Application Level

Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS 1999.

• Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.

• Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, March 2001.

• A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-TDSC, April-June 2005.

• Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S. Trivedi, IEEE Trans. Reliability, Sept. 2006.

Page 98: Software Faults, Failures and Their Mitigations | Turing100@Persistent

Copyright © 2013 by K.S. Trivedi

Key References

• Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, Feb. 2007.

• Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008.

• Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI, K., Maciel, P. , ISSRE, 2010.

• Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S. Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010.

• An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010.

• Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011.

• Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011

• O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book)• M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE

1996• R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based

Telecommunications System, ISSRE 1992