Upload
jennifer-rodriquez
View
35
Download
0
Embed Size (px)
DESCRIPTION
FAULT-TOLERANT COMPUTING. Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Reliability Modeling and Analysis Lecture Set 3. Overview. Introduction Reliability Modeling reliability block diagram combinatorial model Markov model - PowerPoint PPT Presentation
Citation preview
FAULT-TOLERANT COMPUTING
Jenn-Wei LinDepartment of Computer Science and Information Engineering
Fu Jen Catholic University
Reliability Modeling and Analysis Lecture Set 3
ECE 753 Fault Tolerant Computing
2
Overview
• Introduction
• Reliability Modeling– reliability block diagram
– combinatorial model
– Markov model
• Other Parameters and analysis
• General remarks and Summary
ECE 753 Fault Tolerant Computing
3
Introduction
• References• [prad:96], [swew:99], [shooman:02]• [triv:82] Books in the first line (three books) contain sufficient material
covering this part of the course
• Recap of definitions
• Importance of analysis and analytical model
• Mathematical formulation for quantitative analysis
ECE 753 Fault Tolerant Computing
4
Introduction (contd.)
• Recap of definitions– Reliability R(t)
– Availability A(t)
– Performability and Dependability
• Importance of analysis and analytical model– to evaluate a design
– a metric to compare different designs
– to provide feedback to the designer during early design stages
– use a model for performance analysis
– used for quantitative and qualitative analysis
ECE 753 Fault Tolerant Computing
5
Introduction (contd.)
• Mathematical formulation for quantitative analysis– consider a large experiment with N components– operate correctly at time t0
– observation at time t• N0(t) - number of correctly operating systems• Nf(t) - number of failed systems
– Hence• Reliability R(t) = N0(t)/N(t) = 1 - Nf(t)/N
– Probability that a component has survived the interval [t0, t]• Unreliability Q(t) = 1 - R(t)• Derivative of reliability: dR/dt = -(1/N)(dNf(t)/dt) • dNf(t)/dt is called instantaneous failure rate of the component
ECE 753 Fault Tolerant Computing
6
Introduction (contd.)
• Mathematical formulation (contd.)– Also
• failure rate at time t– (instantaneous failure rate at time t) / N0(t)– (1/No(t))(dNf(t)/dt) - called z(t)– this and the previous expressions together reduce to
» z(t) = -(1/R(t))(dR(t)/dt)» z(t) is called failure rate function, hazard function or hazard
rate– We can solve the above for R(t) provided we know
instantaneous failure rate– Bath tub curve for failure rate function
» implies constant failure rate during useful life» infant mortality and wear out periods have variable failure
rates
ECE 753 Fault Tolerant Computing
7
ECE 753 Fault Tolerant Computing
8
Introduction (contd.)
• Mathematical formulation (contd.)– Reliability computation - constant failure rate
• dR(t)/dt =-z(t)R(t)• solve the equations - exponential function for reliability and
for unreliability, R(t) = 1- Q(t) = exp(-λt)
– Reliability computation - time varying failure rate
• Waibull distribution z(t) = αλ(λt)**(α-1)• solve the equations - exponential function for reliability and
for unreliability
– Failure rate computation - military standard• function of - learning factor, quality factor, temperature factor,
environmental factor, and # of pins on IC
ECE 753 Fault Tolerant Computing
9
Introduction (contd.)
• Mathematical formulation (contd.)– Reliability computation - mean time to failure (MTTF)
• Definition: expected time that a system will operate before the first failure occurs
• Probability measure: S-sample space, E-event space– for A in E P(A) >= 0
– P(S) = 1
– P(AB) = P(A) + P(B), when A and B are non-intersecting
• Random Variable (RV) - X maps events of S to real-numbers
• Probability distribution function of a RV
• Probability density function (pdf) - derivative of the distribution function
ECE 753 Fault Tolerant Computing
10
Introduction (contd.)
• Mathematical formulation (contd.)– Reliability computation - mean time to failure
• Probability density function - properties– always >= 0
– integrates to 1 (between limits)
• Expectation– Integrate xf(x)
– Σ xi p(xi) in discrete case
• Application in our case– unreliability Q(t) is a probability distribution function of failure -
in fact it is cumulative probability that system fails in time [0,t]
ECE 753 Fault Tolerant Computing
11
Introduction (contd.)
• Mathematical formulation (contd.)– Reliability computation - MTTF and MTTR
• Application in our case (contd.)– derivative of Q(t) , written as f(t), is pdf of failure - or failure
density function– Expected value can be computed using integration and is
Mean Time To Failure (MTTF)– constant failure rate
» MTTF = 1/λ• Mean time to repair - MTTR
– assume constant repair rate (μ) and arguments similar to those used for failure analysis and conclude MTTR = 1/ μ
ECE 753 Fault Tolerant Computing
12
Introduction (contd.)
• Mathematical formulation (contd.)– Reliability computation - mean time between failure (MTBF)
• Mean time between failure - MTBF– use heuristic arguments to conclude
» MTBF = (total time T)/(average number of failures)
– can also argue MTBF = MTTF + MTTR
• Note: often λ << μ and hence MTTF >> MTTR , therefore the words MTTF and MTBF are used interchangeably
ECE 753 Fault Tolerant Computing
13
Reliability Modeling
• Application of the previous analysis to system models– Assumptions
• system consists of modules
• each module assigned a probability of working R(t), a function of time
• once a module fails it is assumed to yield incorrect results
• module failures are independent
ECE 753 Fault Tolerant Computing
14
Reliability Modeling
• Application of the previous analysis to system models– Reliability block diagrams
• consider a system - microP, controller, mem, bus, …
• the system will fail if any of the components fails
• Rsys = P(all subsystems work correctly)
= P(bus correct).P(mem correct)…. Etc.
(follows from the assumption that component
failures are independent)
• Rsys = Rbus.Rmem.Rmicro.Rcont
ECE 753 Fault Tolerant Computing
15
Reliability Modeling– Reliability block diagrams - Series Systems
• Assume system has n components
• All components should survive for system to operate
• Reliability of system– R sys = i Ri (t)
• For exponential distributions of each component– R sys = i e - i t = e - (1 + )t =exp(-it)
– Effect is that the system failure rate is the summation of failure rates of components
• Note these are nonredundant systemsR1 R2 Rn
ECE 753 Fault Tolerant Computing
16
Reliability Modeling– Reliability block diagrams - Parallel Systems
• Assume system with spares
• faulty component is replaced by a spare as fault occurs
• only one component needs to survive for the system to operate
• Model is to represent all components connected in parallel
• P(sys fail) = P(M1 fails).P(M2 fails). .. .P(Mn fails)
• Rsys = 1 - P(sys fail) = 1- (1-R1)(1-R2) …(1-Rn)
ECE 753 Fault Tolerant Computing
17
ECE 753 Fault Tolerant Computing
18
Reliability Modeling– Reliability block diagrams - Series-Parallel Systems
• straight forward
– Reliability block diagrams - MTTF of system
• 1/(system failure rate)• Series systems - 1/(sum of individual failure rates)
• Parallel systems and series parallel systems – work out by integration from the reliability or unreliability equations
ECE 753 Fault Tolerant Computing
19
Reliability Modeling– Reliability block diagrams -Non series parallel systems
• Bayes rule: consider a sample space S. Partitions this into space B andB (complement of B). Now consider an event that falls partly in B and partly inB. We can write:
A = (AB)(AB)
P(A) = P[(AB)(AB)]
= P[(AB)] + P[(AB)]
= P(A/B)P(B) + P(A/B)P(B)
• In general the set S can be partitioned into (B1, B2, … ,Bn)
P(A) = Σ P(A/Bi)P(Bi)
This can be viewed graphically also (draw a tree)
ECE 753 Fault Tolerant Computing
20
Reliability Modeling• Reliability block diagrams -Non series parallel systems
– Example - consider the following non series parallel system
– list all paths for system to survive, namely c1c4, c2c4, c2c5, c3c5
– These paths are not disjoint, sum of reliabilities of all path gives an upper bound on the system reliability
– Exact computation is possible using Bayes rule – complete in class
C5
C4
C3
C2
C1
ECE 753 Fault Tolerant Computing
21
Reliability Modeling– Combinatorial model
• Consider an NMR system
• Assume voter reliability to be 1
• Divide all events for success to disjointed events
• Compute probability of each event and add them
• Example – TMR system
• Can be used to compute MTTF
• Can also analyze other systems such as an m-of-n system
ECE 753 Fault Tolerant Computing
22
ECE 753 Fault Tolerant Computing
23
ECE 753 Fault Tolerant Computing
24
ECE 753 Fault Tolerant Computing
25
Reliability Modeling– Markov model
• Difficulty with the previous models– incorporating repairs in the model and analysis
– Incorporation of coverage factor – such as in duplicates system we may be less than 100% certain that only faulty unit will be eliminated when system is re-configured
• Markov modeling - basic– Define the concept of state using TMR system example (8 states)
– Transitions between states occur with certain probabilities
• Markov model – assumption– Probability of transition from a state si to sj is independent of the
method of arrival into state si
• Example – develop a Markov model for a TMR in class
ECE 753 Fault Tolerant Computing
26
Reliability Modeling– Markov model
• Markov model for a TMR – all details not shown
111
110
101
011
100
010
001
000
λΔt
λΔt
λΔt
1-3λΔt
ECE 753 Fault Tolerant Computing
27
ECE 753 Fault Tolerant Computing
28
Reliability Modeling– Markov model- Reduced
• Reduced Markov model for a TMR system
• Previous eight state model can be reduced to a three state model by merging states and re-computing the transition probabilities
– Markov model- accounting for repairs• We can include links between states knowing the repair rates
of components
ECE 753 Fault Tolerant Computing
29
Reliability Modeling– Markov model- analyzing systems
• Consider a duplicate compare system – no repairs• Develop Markov model with 3 states
• Develop a difference equation for computing probabilities for being in different states of the system
• Develop a differential equation model
• Solution methods– Numerical approach
– Solving differential equation
» direct approach
» Using Laplace transforms
ECE 753 Fault Tolerant Computing
30
Reliability Modeling– Markov model- analyzing systems
• Consider a duplicate compare system – with repairs• Develop Markov model with 3 states
• Develop a differential equation model
• Solve using Laplace transforms
– Yet one more example• duplicate compare system – with imperfect coverage
• Develop Markov model with 5 states
• Reduce model for different scenarios
ECE 753 Fault Tolerant Computing
31
Summary
• Introduction of mathematical models• Solving models to carry out analysis
– Example systems• Duplicate
• Duplicate with repair
• Simplex with repair for avialability