32
Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok Southern Methodist University US/UK Workshop on Network-Centric Operation and Network Enabled Capability, Washington, D.C., July 24-25, 2008

Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Embed Size (px)

Citation preview

Page 1: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Developing Dependable Systems by Maximizing Component

Diversity and Fault Tolerance

Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Southern Methodist University

US/UK Workshop on Network-Centric Operation and Network Enabled Capability, Washington, D.C., July 24-25,

2008

Page 2: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Outline Overall Framework External Environment Profiling Component Dependability:

Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective

Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation

Summary and Future Work

7/24/2008 2US/UK NCO/NEC Workshop

Page 3: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Overall Framework Systems made up of different components Many factors contribute to system dependability

Our focus: Diversity of individual components Component strength/weakness/diversity:

Target: Different dependability attributes and sub-attributes

External reference: Operational profile (OP) Internal assessment: Contributors to dependability Value perspective: Relative importance and trade-off

Maximize diversity => Maximize dependability Combine strength Avoid/complement/tolerate flaws/weaknesses

7/24/2008 3US/UK NCO/NEC Workshop

Page 4: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Overall Framework (2) Diversity: Four Perspectives

Environmental perspective: Operational profile (OP) Target perspective: Goal, requirement Internal contributor perspective: Internal

characteristics Value perspective: Customer

Achieving diversity and fault tolerance: Component evaluation matrix per target per OP Multidimensional evaluation/composition via DEA

(Data Envelopment Analysis) Internal contributor to dependability mapping Value-based evaluation using single objective function

7/24/2008 4US/UK NCO/NEC Workshop

Page 5: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Terminology Quality and dependability are typically defined

in terms of conformance to customer’s expectations and requirements

Key concepts: defect, failure, fault, and error Dependability: the focus in this presentation Key attributes: reliability, security, etc.

Defect = some problem with the software either with its external behavior or with its internal characteristics

7/24/2008 5US/UK NCO/NEC Workshop

Page 6: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Failure, Fault, Error IEEE STD 610.12 terms related to defect:

Failure: The inability of a system or component to perform its required functions within specified requirements

Fault: An incorrect step, process, or data definition in a computer program

Error: A human action that produces an incorrect result

Errors may cause faults to be injected into the software

Faults may cause failures when the software is executed

7/24/2008 6US/UK NCO/NEC Workshop

Page 7: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Reliability and Other Dependability Attributes Software reliability = the probability for

failure-free operation of a program for a specified time under a specified set of operating conditions (Lyu, 1995; Musa et al., 1987)

Estimated according to various model based on defect and time/input measurements

Standard definitions for other dependability attributes, such as security, fault tolerance, availability, etc.

7/24/2008 7US/UK NCO/NEC Workshop

Page 8: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Outline Overall Framework External Environment Profiling Component Dependability:

Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective

Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation

Summary and Future Work

7/24/2008 8US/UK NCO/NEC Workshop

Page 9: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Diversity: Environmental Perspective Dependability defined for a specific environment Stationary vs dynamic usage environments

Static, uniform, or stationary (reached an equilibrium) Dynamic, changing, evolving, with possible

unanticipated changes or disturbances Single/overall OP for former category

Musa or Markov variation Single evaluation result possible per component per

dependability attribute: e.g., component reliability R(i) Environment Profiling for Individual Components

Environmental snapshots captured in Musa or Markov Ops

Evaluation matrix (later)

7/24/2008 9US/UK NCO/NEC Workshop

Page 10: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Operational Profile (OP)

Operational profile (OP) is a list of disjoint set of operations and their associated probabilities of occurrence (Musa 1998)

OP describes how users use an application: Help guide the allocation of test cases in accordance

with use Ensure that the most frequent operations will receive

more testing As the context for realistic reliability evaluation Other usages, including diversity and internal-

external mapping in this presentation

7/24/2008 10US/UK NCO/NEC Workshop

Page 11: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Markov Chain Usage Model

Markov chain usage model is a set of states, transitions, and the transition probabilities

As an alternative to Musa (flat) OP Each link has an associated probability of occurrence Models complex and/or interactive systems better

Unified Markov Models (Kallepalli and Tian, 2001; Tian et al., 2003):

Collection of Markov Ops in a hierarchy Flexible application in testing and reliability

improvement

7/24/2008 11US/UK NCO/NEC Workshop

Page 12: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Operational Profile Development:Standard Procedure

Musa’s steps (1998) for OP construction: Identify the initiators of operations Choose a representation (tabular or graphical) Create an operations “list” Establish the occurrence rates of the individual

operations Establish the occurrence probabilities

Other variations Original Musa (1993): 5 top-down refinement steps Markov OP (Tian et al): FSM then probabilities based

on log files

7/24/2008 12US/UK NCO/NEC Workshop

Page 13: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

OPs for Composite Systems

Using standard procedure whenever possible For overall stationary environment For individual component usage => component OP For dynamic environment:

Snapshot identification Sets of OPs for each snapshot System OP from individual component OPs

Special considerations: Existing test data or operational logs can be used to

develop component OPs Union of component OPs => system OP

7/24/2008 13US/UK NCO/NEC Workshop

Page 14: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

OP and Dependability Evaluation

Some dependability attributes defined with respect to a specific OP: e.g., reliability For overall stationary environment: direct

measurement and assessment possible For dynamic environment: OP-reliability pairs Consequence of improper reuse due to different OPs

(Weyuker 1998) From component to system dependability:

Customization/selection of best-fit OP for estimation Compositional approach (Hamlet et al, 2001)

7/24/2008 14US/UK NCO/NEC Workshop

Page 15: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Outline Overall Framework External Environment Profiling Component Dependability:

Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective

Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation

Summary and Future Work

7/24/2008 15US/UK NCO/NEC Workshop

Page 16: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Diversity: Target Perspective Component Dependability:

Component reliability, security, etc. to be scored/evaluated

Direct Measurement and Assessment Indirect Assessment (later)

Under stationary environment: Dependability vector for each component Diversity maximization via DEA (data envelopment

analysis) Under dynamic environment:

Dependability matrix for each component Diversity maximization via extended DEA by flattening

out the matrix

7/24/2008 16US/UK NCO/NEC Workshop

Page 17: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Diversity Maximization via DEA DEA (data envelopment

analysis): Non-parametric analysis Establishes a multivariate

frontier in a dataset Basis: linear programming Applying DEA

Dependability attribute frontier

Illustrative example (right) N-dimensional: hyperplane

7/24/2008 17US/UK NCO/NEC Workshop

Page 18: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

DEA Example

Lockheed-Martin software project performance with regard to selected metrics and production efficiency model

Measures efficiencies of decision making units (DMU) using weighted sums of inputs and weighted sums of outputs

Compares DMUs to each other Sensitivity analysis affords study of non-efficient DMUs in

comparison BCC VRS Model used in initial study

InputsInputs OutputsOutputs• Labor hours• Software Change Size

• Software Reliability At Release• Defect Density after test• Software Productivity

EfficiencyEfficiencyOutput/Input

7/24/2008 18US/UK NCO/NEC Workshop

Page 19: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

DEA Example (2) Using

production efficiency model for Compute-Intensive dataset group

Ranked set of projects

Data showing distance and direction from efficiency frontier

Rank DMU Score1 34 11 30 11 26 11 22 11 10 11 13 11 15 18 14 0.944099 4 0.83152710 37 0.80533311 1 0.40572212 7 0.25670513 18 0.210479

DMU Score I/O Data Projection Difference %

1 0.405722Chng_Size_code 196493.5 79721.8 -116772 -59.43%Total_Labor 48800.03 19799.26 -29000.8 -59.43%DD_After_test_MESLOC 59.96817 96.5992 36.63103 61.08%ESLOC_per_labor_mo 4.026504 7.851672 3.825168 95.00%Weighted_Reliability_at_Release22.83505 46.10035 23.2653 101.88%

4 0.831527Chng_Size_code 179734.6 149454.2 -30280.4 -16.85%Total_Labor 12400.21 10311.11 -2089.1 -16.85%DD_After_test_MESLOC 47.08071 47.08071 0 0.00%ESLOC_per_labor_mo 14.49448 15.63405 1.13957 7.86%Weighted_Reliability_at_Release27.33631 49.03719 21.70089 79.38%

7 0.256705Chng_Size_code 416797.6 106994 -309804 -74.33%Total_Labor 66587.41 17093.33 -49494.1 -74.33%DD_After_test_MESLOC 97.9607 97.9607 0 0.00%ESLOC_per_labor_mo 6.259405 10.18545 3.926048 62.72%Weighted_Reliability_at_Release15.05019 49.30659 34.25639 227.61%

10 1Chng_Size_code 330386.7 330386.7 0 0.00%Total_Labor 17136.34 17136.34 0 0.00%DD_After_test_MESLOC 67.15824 67.15824 0 0.00%ESLOC_per_labor_mo 19.27988 19.27988 0 0.00%Weighted_Reliability_at_Release12.08211 12.08211 0 0.00%

13 1Chng_Size_code 132123.2 132123.2 0 0.00%Total_Labor 10384 10384 0 0.00%DD_After_test_MESLOC 13.12492 13.12492 0 0.00%ESLOC_per_labor_mo 12.72373 12.72373 0 0.00%Weighted_Reliability_at_Release109.6671 109.6671 0 0.00%

7/24/2008 19US/UK NCO/NEC Workshop

Page 20: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Diversity: Internal Perspective Component Dependability:

Direct Measurement and Assessment: might not be available, feasible, or cost-effective

Indirect Assessment via Internal Contributor Mapping Internal Contributors:

System design, architecture Component internal characteristics: size, complexity,

etc. Process/people/other characteristics Usually more readily available data/measurements

Internal=>External mapping Procedure with OP as input too (e.g., fault=>reliability)

7/24/2008 20US/UK NCO/NEC Workshop

Page 21: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Example: Fault-Failure Mapping for Dynamic Web Applications

Web server logs

Defect Data from Defect

Tracking tool

Application Operational

Profile

Defect Impact

Scheme

Step 2Classification of

HTTP Responses

Step 4Number of Hit with

successful response code

Step 1Classification of

defect information

Step 3TOP HTTP faults

Step 5Number of

transactions

Step 6Top Faults from

Defect data

Step 7Top List

7/24/2008 21US/UK NCO/NEC Workshop

Page 22: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Web Example: Fault-Failure Mapping Input to analysis (and fault-failure conversion):

Anomalies recorded in web server logs (failure view) Faults recorded during development and maintenance Defect impact scheme (weights) Operational profile

Product “A” is an ordering web application for telecom services

Consists of hundreds of thousands of lines of code Running on IIS 6.0 (Microsoft Internet Information

Server), Process couple of millions requests per day

7/24/2008 22US/UK NCO/NEC Workshop

Page 23: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Web Example: Fault-Failure Mapping (Step 1)

Defect Data classes

0%5%

10%15%20%25%30%

Inte

rface

s

Code

Log

ic,co

mpu

tatio

nus

er in

terfa

ceco

de Miss

ingve

rbiag

e

Miss

ing fi

les

Brok

en o

rm

issing

or

Wro

ng o

utpu

tst

ate

Data

issu

e

Miss

ing In

put

fields

Inpu

tco

nstra

int/v

alid cach

e

Error class

Erro

r per

cent

age

• Pareto chart for the defect classification of product “A”

• The top three categories represent 66.26% of the total defect data7/24/2008 23US/UK NCO/NEC Workshop

Page 24: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Web Example: Fault-Failure Mapping (Steps 4 & 5)

Number of Hits with response code 200 and 300

235142

Average Number of hits per transaction 40

Number of transactions 5880

Operation OperationProbabilit

y

Number of Transaction

s

New order 0.1 588

Change order

0.35 2058

Move order 0.1 588

Order Status 0.45 2646

• OP for product “A” and the corresponding numbers of transactions.

7/24/2008 24US/UK NCO/NEC Workshop

Page 25: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Web Example: Fault-Failure Mapping (Step 6)

Application Aspect

Impact Weight Number of transactions

FailureFrequency

Order status

Showstopper

100% 2646 2646

Order status

High 70% 2646 1852

Order status

Medium 50% 2646 1323

Order status

Low 20% 2646 529

Order status

Exception 5% 2646 132

• Using the number of transactions calculated from OP and the defined fault impact schema, we calculated the fault exposure or corresponding potential failure frequencies

7/24/2008 25US/UK NCO/NEC Workshop

Page 26: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Web Example: Fault-Failure Mapping (Step 7)

Rank ResponseCode

Fault FailureFrequency

1 404 /images/dottedsep.gif 5805

2 404 /images/gnav_redbar_s_r.gif 3687

3 404 /images/gnav_redbar_s_l.gif 3537

4 200/300 Order status – showstopper 2646

5 404 /includes/css/images/background.gif 2593

6 200/300 Change order- showstopper 2058

7 200/300 Order status – high 1852

8 200/300 Change order – high 1441

9 200/300 Order status – medium 1323

10 200/300 Change order – medium 1029

11 404 /includes/css/nc2004style.css 721

7/24/2008 26US/UK NCO/NEC Workshop

Page 27: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Web Example: Fault-Failure Mapping (Result Analysis) A large number of failures were caused by a

small number of errors with high usage frequencies

Fixing faults with a high usage frequency and a high impact could achieve better efficiency in reliability improvement

By fixing the top 6.8% faults, the total failures were reduced by about 57%

Similarly, 10% -> 66%, 15%->71%, 20%->75%, for top-faults induced failure reduction

Defect data repository and web server log recorded failures have insignificant overlap => both are needed for effective reliability improvement

7/24/2008 27US/UK NCO/NEC Workshop

Page 28: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Diversity: Value Perspective Component Dependability Attribute:

Direct Measurement and Assessment: might not capture what customers truly care about

Different value attached to different dependability attributes

Value-based software quality analysis: Quantitative model for software dependability ROI

analysis Avoid one-size-fits-all

Value-based process: experience at NASA/USC (Huang and Boehm) extend to dependability

Mapping to value-based perspective more meaningful to target customers

7/24/2008 28US/UK NCO/NEC Workshop

Page 29: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Value Maximization Single objective

function: Relative importance Trade-off possible Quantification scheme Gradient scale to

selecte component(s) Compare to DEA General cases

Combination with DEA Diversity as a separate

dimension possible

7/24/2008 29US/UK NCO/NEC Workshop

Page 30: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Outline Overall Framework External Environment Profiling Component Dependability:

Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective

Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation

Summary and Future Work

7/24/2008 30US/UK NCO/NEC Workshop

Page 31: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Experimental Evaluation Testbed

Basis: OPs Focus on problems and system behavior under

injected or simulated problems Fault Injection for Reliability and Fault

Tolerance Reliability mapping for injected faults Use of fault seeding models Direct fault tolerance evaluation

Security Threat Simulation Focus 1: likely scenarios Focus 2: coverage via diversity

7/24/2008 31US/UK NCO/NEC Workshop

Page 32: Developing Dependable Systems by Maximizing Component Diversity and Fault Tolerance Jeff Tian, Suku Nair, LiGuo Huang, Nasser Alaeddine and Michael Siok

Summary and Future Work Overall Framework External Environment Profiling Component Dependability:

Direct Measurement and Assessment Indirect Assessment via Internal Contributor Mapping Value Perspective

Experimental Evaluation Fault Injection for Reliability and Fault Tolerance Security Threat Simulation

Summary and Future Work

7/24/2008 32US/UK NCO/NEC Workshop