Expert-based maintenance: a study of its effectiveness

IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 1, 1998 MARCH 53

Expert-Batsed Maintenance: A Study of Its Effectiveness

P.K. Chande, Senior Member IEEE S.G.S. Institute of Technology and Science, Indore

S.V. Tokekar, Member IEEE Devi Ahilya University, Indore

Key Words - Expert system, Computer maintenance, Per- formance modeling.

Abstract - Monitoring of computer-based systems by a supervisory computer is common for high-availability systems. Expert-based supervisory systems are being proposed which are able to use dynamic information of the system to operate them with increased reliability. This paper brings out the functional capabilities of expert-based maintenance, and presents an analytic model to evaluate the effectiveness of the expert system in maintenance. The abilities of the expert system to maintain the host are parameterized and their effects on the performance of the system are studied. The results show possible improvement in the performance of a host due to expert-based maintenance.

1. INTRODUCTION

In real-time industrial systems, a failure can damage the economic operation of a plant. Therefore, it is preferred to keep the plant operating continuously, even in a degraded state, and then, subsequently, fix it. In this regard, research & devlelopment of fault-tolerant systems shows that a dedicated processor, known as maintenance processor, can be used to monitor, control, and maintain the operation of a host system [l]. A maintenance processor can be a free standling machine, an individual frame, or a module. In order to reduce simultaneous failures of the host and the mainlienance processors, they are made functionally autonomous. However, the firmware of the maintenance processor cooperates with the operating system of the host for effective maintenance. Since the maintenance processor monitors the host-system concurrently, any abnormal condition can be detected and then handled by it. It also collects the status information and displays it on the local console with which, the operator can perform recovery actions. ‘Thus the maintenance processor [1 ] improves reliability, availability, and serviceability of the total system. On-line function analyzers are also useful in this regard [Z].

Expert-based maintenance methodologies are proposed & reported in [3 - 51. They have the potential to improve reliability of systems, besides the conventional monitoring functions. The expert-based maintenance systems can, eg: . perform the functions to delay the system failure,

operate the system effectively in a degraded mode, envisage the possibility of faults and their recovery, help faster recovery after occurrence of faults.

From other studies and [6], the degradation of the systems due to software failures is low compared to hardware failures. This suggests that software measures, like expert-based maintenance of the host, are helpful in im- proving the system reliability. This paper therefore studies the effectiveness of expert-based maintenance for real-time systems.

2. SYSTEM ARCHITECTURE

Embedded systems typically have two component types: hardware & software. Different analytic methods have been proposed to evaluate hardware reliability [7 - 91 and software reliability [ lo - 111 of computer-based systems. Reliability estimation of the whole system, with practical interest, is reported in [6]. However, the software component in this system must perform control operations only, and [6] does not emphasize the maintenance support of the system. R.ef [l] uses a maintenance processor along with a host system and presents the reliability study of the integrated computer system.

The analytic models in the previous paragraph repre- sent the mechanism of operation of the systems based on certain operating assumptions; the operation of a system is then described in terms of operating states and the tran- sitions between them. Similarly, understanding the functionality of the system with expert-based maintenance is required to perform appropriate reliability studies. There- fore the expert-based maintenance of a system can be con- sidered in the category of real-time knowledge-based systems [ 121. Various real-time knowledge-based systems for maintenance, etc are reported in the literature. Ref [13] discusses in detail the functional structure of the real-time knowledge-based system.

This paper distinguishes the real-time control processes from the knowledge-based process. That is, the system has two modules: 1) the host, and 2) the autonomous expert maintenance system; this is important in creating the reliability model here. The host could be a general purpose system which performs the designated tasks while the expert system is functionally autonomous and maintains the host.

00 18-9529/98/$10.00 01998 IEEE

54 IEEE TRANSACTIONS ON RELIABILITY, VOL. 47, NO. 1, 1998 MARCH

Typically, the expert system can perform tasks like: I fault anticipation, . fault recovery, . degraded-system operation, . recovery on total host failure.

2.1 Fault Anticipation The expert system can be used to anticipate two classes

of faults; critical & non-critical. (A critical fault is a fault which can not be repaired transparently to the operation of the system, without degradation.) These faults can be transparent [14] or non-transparent. (A transparent fault is a fault, as defined in [14], which can not be directly detected by a sensor but can be detected by the expert system from the dynamics of the host.) The logic for the anticipation of faults depends on an individual application. The following typical rules illustrate it: 1. Fault Anticipation:

a. Gain(exchanger)-value LOW + Anticipated-fault (stuck-exchanger-valve). b. Gt(va1ve-1-close-signal-count, 3 ) + Failed( solenoid-1) .

Parameter-1-value LOW n Parameter-2-value LOW n Parameter-3-value LARGE + Defective(modu1e-2)

2. Transparent Fault-Detection:

2.2 Fault Recovery The expert system can be designed & used for the recov-

ery of the host from a failure after the fault has occurred. This enables faster recovery of the host. However, in a more effective approach, the expert system can be used to avoid a failure by anticipating the fault and repairing it transparently without any host degradation - as proposed in this paper. This is possible if the expert system is sufficiently zntelltgent. That is, the expert system can detect unusual dynamics of the process, eg, . deviation from its usual operating conditions, 9 repeated attempts to accomplish a function, I delay in operations, n diminishing performance, . identify idle cycles of modules. The host functionality is such that the anticipated fault (transparent or non-transparent [14]) can be repaired transparently to the overall operation of the host. The repair of faults can then be achieved without any host degradation for maintenance. If the host functionality does not permit transparent repairs, then the host has to be degraded while repairing the faults. This paper considers a fault as non-critical if it can be repaired transparently. The fo1lowi;ng typical rules illustrate it: 1. Fault recovery:

fault-sequence(f1, f2, f5, f7) + repair-sequence(f2, f1, f7, f5). The f l , ...,f7 are the fault identifiers.

2. Transparent-fault repair: Idle(modu1e-3) f l operational(modu1e-8) + replace( exchanger -valve).

2.3 Degraded-System Operation

The expert system can be designed to help operate a host in a degraded mode so that the total shutdown of the host on the occurrence of a fault(s) can be avbided [3 - 41. In such a design, the expert system needs to a have an overall understanding of the process to operate it in a safe manner. For example, if two pumps are feeding a reservoir, one can be shut-off on occurrence of a fault, and the other can still continue. The process in this condition can be appropriately degraded at lower performance. A typical rule-base contains rules like: 1. Pump-1 ON n Pump-2 ON +

Open Out-valve MAXIMUM. 2. Pump-1 ON n Pump2 OFF =+

Open Out-valve MEDIUM.

Thus, expert-based maintenance with proper redun- dancy in the host system reduces the malignant effect of the faults. In the degraded mode the expert can support the host in its operations but can not provide any further protection against failure - because no spare host capac- ity is available. Any further fault damages the host.

2.4 Recovery on Total Host-Failure Since the expert system can have complete knowledge

of how, when, why the host has failed, that knowledge can be used to help recover the host faster. This virtue can reduce the downtime of the system [la]. In general, a host recovery is similar to a fault recovery procedure (see section 2 . 2 ) .

3 . THE MARKOV CHAIN MODEL

Notatzon ANC Pr{anticipation of non-critical fault}

RNC Pr{repair of non-critical fault} AC Pr{anticipation of critical fault}

RC Pr{repair of critical fault} S, state 2

p , steady-state probability of S, Xg,d Ag,c Ad,c X,,J expert failure rate

pe,r expert repair rate p d , g

‘ p C , , P,,d Pr,d ,ur,c At availability of total system A,

MUTt MUT,

failure rate from good to degraded failure rate from good to critical failure rate from degraded to critical

failure rate from S, to S, in figure 1

repair rate from degraded to good repair rate from critical to good repair rate from critical to degraded anticipatory repair rate of degraded faults anticipatory repair rate of critical faults

availability of good (undegraded) system mean up-time of total system mean up-time of good (undegraded) system

3ther, standard notation is given in “Information for R.eaders & Authors” at the rear of each issue.

CHANDE/TOKEKAR: EXPERT-BASED MAINTENANCE: A STUDY OF ITS EFFECTIVENESS

Figure 1: Markov Chain Model for Expert-Based Maintenance

Assumed Values (for this model) Ag,d l/12O failures/hour Ag,c 1/480 failures/hour Ad,c 1/240 failures/hour A,,f 1 /lo00 faihires/hour p,,, 1 /12 repairs/hour p d , g 1/12 repairs/hour , u ~ , ~ 1 /48 repairs/hour ,&d 1/24 repairs/hour P,,d 1/6 repairs/hour P , , ~ 1/12 repairs/hour.

Assumptzons 1. The system is a Markov chain. 2. The Markov chain is that in figure 1. 3. Failure & repair rates are parameterized by the con-

stant probabilities ADlC, AC, RNC, RC to depict various level of capabilities of the expert. 4

The Markov chain model represents the system operation with 8 different states. These states are described here, and are shown in figure 1.

In SI, the expert system and the host, are both functional. The expert system, along with its usual functions,

performs the task of fault anticipation. If a non-critical fault, which is repairable (see section 2.1) transparently to the system operation, is anticipated by the expert system, then the expert system repairs it. This is represented by a transition from SI to S2.

ANC is a factor which reflects the capability of the expert system to anticipate a fault. The valueof ANC (similar to the coverage factor in conventional reliability modeling) is the ratio of the number of faults the expert system can anticipate and the total number of faults that can occur. If such a fault can not be anticipated, then the fault would actually occur in due course of time. In that case the system transits from SI to S4.

In S4 the system is functional with degraded performance. The expert-system tries to recover the system and to manage the degraded performance. If the fault anticipated by the expert-system in SI is the critical type then the system attains Sg. In S g , the expert-system again manages the host in degraded mode and tries to repair the fault.

If the critical fault occurs, and escapes the anticipation by the expert-system (SI) or could not be repaired in 5’5, then the system transits to ST, which is a failed state.


s6 is similar to sq, and SS is similar to s?, and s3 is similar to S2 - as far as the operation of the host is concerned, except for the fact that the expert is down in S3, s6, Sg. This means that host in S3 is working but no expert-system support is available for maintenance, with respect to anticipation of faults, . In $3, the host is degraded, because no expert-system support is available to the host for further repairs. . In Sg, both the host and expert-system are down and thus the system is failed. In the model (figure 1) the recovery of the expert system is performed first. This is of particular interest as the expert system is always helpful in the recovery of the host.

Thus, classify the operations of the system into: ’ Good E Si + 52 + S,, . Degraded . Down = S7 + Sg.

using the appropriate pz [15]:

$4 + S5 + s6, System availability and mean up-time are then found by

6

At = CPi, i=l

3

A, = Z P i l i=l

Section 4 studies the effects of variation of ANC, AC, R.NC, RC of the expert system on the performance of the host.

4. RESULTS We have studied the system to see the effects of critical

and non-critical faults with variation in capabilities of the expert-maintenance system. Total availability of the system (good and degraded) as well as availability of a good system have been observed:

1. (figure 2) by abandoning the expert-system to anticipate critical faults (AC=RC=O), while enabling it to anticipate non-critical faults, RNC=0.95, ANC=O to 0.95;

2 . (figure 3 ) wzce-versa: ANC=RNC=O, RC=0.95, AC = 0.0 to 0.95.

Figure 2 shows that the total system availability remains unchanged whether or not the expert system takes care of non-critical faults which are repairable transparently to the system operation. Therefore, for a system which can run in good as well as degraded, the expert system need not be designed to take care of ‘non-critical transparent- repairable faults’. However, in such a situation, since the system is required to operate also in the degraded mode, the expert system should be able to manage the host in the degraded mode. If the requirement is such that the system can degrade but, is mostly preferred in the good

0.82 1

Figure 2: Availability vs ANC

1 0.98 0.98

A I B

0.94 0.92 0.9

0.88 0.86

ANC = R.NC = 0, R.C = 0.95

Figure 3: Availability vs AC

state, then the expert system should also effectively at- tempt to anticipate the non-critical faults, and to repair them transparently.

Figure 3 shows that, to have better total availability (good + degraded), anticipation and repair of critical faults (AC & RC) should have higher values. Thus, as far as the total availability of the system is concerned, more emphasis can be given to the expert system in repairing critical faults, and managing the host operation in degraded mode.

Figure 4 plots the system availability US ANC, keeping anticipation and recovery of critical faults to their maximum (AC = RC = 0.95). The nature of availability curves for good & total remain the same, but with increased avail- abilities, as compared to figure 2 where AC = RC = 0.

Figures 5 & 6 show the mean up-time of the system. Figure 5 plots the MUT of the good system vs anticipa-

tion of non-critical faults, keeping other parameters con- stant. The MUT of the good system increases exponentially with increase in ANC. Therefore, if the system is required to operate on its own, for longer time without degradation, the anticipation of non-critical faults, and their transparent recovery by the expert system can be beneficial.

Figure 6 shows that the MUT of the total system (which

CHANDE/TOKEKAR EXPE,RT-BASED MAINTENANCE A STUDY OF ITS EFFECTIVENESS 57

0.M

:; ’5 a 0.00

0.06

0.04

Figure 4: Availability vs ANC

Figure 5: MUT, vs ANC

I ’ 478.9 t

AC = R,C: = 0, R.NC = 0.95

Figure 6: MUTt vs ANC

Figure 7: MUTt vs AC

0 . - N O O ~ X 3 X ~ ~ g ? AC-B

ANC = RNC = 0, RC = 0.95

Figure 8: AMUTt/AAC vs AC

also includes the system in degraded mode) remains con- stant irrespective of the values of ANC. Thus if the MUT of the total system is of concern, then the anticipation and transparent recovery of non-critical faults are irrelevant.

Figure 7 plots the MUT of the total system vs AC. The MUT of the total system increases exponentially. Therefore, anticipation of critical faults and their non- transparent repair are beneficial to keep the MUT of the total system high.

Figure 8 shows AMUTtIAAC as a function of AC. The A implies an incremental change in MUTt or in AC.

R.EFERENCES

[l] T.S. Liu, “The role of a maintenance processor for a general purpose computer system”, IEEE Trans. Computers, vol C-33, 1984 Jun, pp 507-517.

[2] Gadi Kaplan, “Nuclear power plant malfunction analysis”,

[3] N. Sormail, X. Tang, P. Millot, D. Willaeys, “An expert system for process control coping with dynamic information”, Proc. IECON’87, 1987 Nov, pp 822-827.

IEEE Spectrum, 1983 Jun, pp 53-58.


[4] B. D’Ambrosio, et al, “Real time process management for materials composition in chemical manufacturing”, IEEE Expert, 1987 Summer, pp 80-81.

[5] R. Dube, A State Transition Model for Rule-Based Expert Systems, PhD Thesis, 1989 Oct; Rutgers University.

[GI G.E. Stark, “Dependability evaluation of integrated hardware/software systems”, IEEE Trans. Reliability, vol R- 36, 1987 Oct, pp 440-444.

[7] D.K. Lloyd, M. Lipow, Reliability Management, Methods

[8] E.J. Henley, H. Kumamoto, Reliability Engineering and

[9] S. Dhillon, System Reliability, Maintainability and Man-

[lo] B. Littlewood, “Theories of software reliability: How good are they and how can they be improved?”, IEEE Trans. Software Engineering, vol SE-6, 1980 Sep, pp 489-500.

and Mathematics (2nd ed), 1984; ASQC.

Risk Assessment, 1981; Prentice Hall.

agement, 1983; Petrocelli Books.

[ll] P.A. Keiller, B. Littlewood, D.R. Miller, A. Safer, “On the quality of software reliability prediction”, Electronic Systems Effectiveness and Life Cycle Costing, (J. K. Skwirzynski, Ed), NATO AS1 Series, vol F3, 1983; Springer-Verlag.

Systems, 1992; Tata McGraw Hill.

[13] H. Voss, “Architectural issues for expert systems in real- time control”, Proc. IFAC Workshop, 1988 Sep, pp 1-6; Swansea, UK.

[12] S.S. Lamba, Y.P. Singh, Distributed Computer Control

AUTHORS

Dr. P.K. Chande; D-36 HIG Colony; Indore - 452 008 INDIA. Internet (e-mad): gsitsQbom4.vsnl.net.in

P.K. Chande (M’82, SM’94) is a Professor and Head of the Dep’t of Computer Eng’g at G.S. Institute of Technology & Science, Indore. He is responsible for the development of the Centre for Robotics & AI at this Institute and is the PI- Coordinator for the World Bank IMPACT Project. He has pur- sued research in multiprocessors, fault-tolerance, real-time systems, computer architecture, automation, and artificial intelli- gence (AI). He has over 90 research papers to this credit, and has appreciably contributed to research, profession, and admin- istrative activities for which he has received various awards. He received the 1992 SICE International Award, the 1992 Dr. Rad- hakreshnam Award, the 1994 Lajja Shankar Jha Award, and the 1996 Anna University National Award. He was a visiting professor at Kumamoto University, Japan, during 1996.

Dr. S.V. Tokekar; Inst. of Computer Science, Electronics, and Instrumentation; Devi Ahilya Univ; Khandwa Road; Indore - 452 001 INDIA. Internet (e-mad): icseQbom2.vsnl.net.in

S.V. Tokekar (M’SO) received the BE (1982) in Electron- ics & Telecommunication, the ME (1985) in Applied Electron- ics and PhD (1996) in Electronics from the S.G.S. Institute of Technology & Science, Indore. In his early career he served in electronics industries. He is now a Reader at the Institute of Computer Science & Electronics, Devi Ahilya University since 1990. He performs research in computer architecture, digital signal processing, computer networks, and microprocessors.

[14] P.K. Chande, S. Kher, M. Shrivastava, llIntelligent di- agnosis of transparent faults in process control”, Proc. IECON92, pp 868-870; 1992 Nov 9-13, San Diego. Manuscript TR94-042 received 1994 April 6;

revised 1996 March 18, 1998 February 5. [15] K.S. Trivedi, Probabzlaty and Statzstacs wath Relzabzlaty, Queuang, and Computer Sczence Applacataons, 1988, Responsible editor: H.C. Benski Prentice-Hall of India. Publisher Item Identifier S 0018-9529(98)05637-1

Documents

Expert-based maintenance: a study of its effectiveness