12
· . EXPANDING ROOT CAUSE ANALYSIS TO INCLUDE ORGANIZATIONAL FACTORS AND WORK PROCESSES by R.W. Tuli, J-S. Wu, G.E. Apostolakis* School of Engineering and Applied Science 38-137 Engineering N University of California Los Angeles, CA 90095-1597 USA tel: (310) 825-1300 fax: (310) 206-2302 [email protected] Presented at the American Nuclear Society International Topical Meeting on Safety Culture in Nuclear Installations Vienna, 24-28 April 1995 * To whom correspondence should be addressed

Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

·.

EXPANDING ROOT CAUSE ANALYSIS TO INCLUDE ORGANIZATIONAL FACTORS AND WORK PROCESSES

by

R.W. Tuli, J-S. Wu, G.E. Apostolakis*

School of Engineering and Applied Science� 38-137 Engineering N� University of California�

Los Angeles, CA 90095-1597� USA�

tel: (310) 825-1300� fax: (310) 206-2302�

[email protected]

Presented at the American Nuclear Society� International Topical Meeting on�

Safety Culture in Nuclear Installations� Vienna, 24-28 April 1995�

* To whom correspondence should be addressed

Page 2: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

·'

Page 3: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

R.N. 940608

EXPANDING ROOT CAUSE ANALYSIS TO INCLUDE ORGANIZATIONAL� FACTORS AND \VORK PROCESSES�

R.W. Tuli, J-S. Wu, G.E. Apostolakis·

School of Engineering and Applied Science� 38-137 Engineering IV� University of California�

Los Angeles, CA 90095-1597� USA�

tel: (310) 825·1300� fax: (3iO) 206-2302�

[email protected]

.. ABSTRACT

All nuclear power plants incorporate root cause analysis to help identify and isolate key factolS judged significant following an incident. Identifying the principal deficiencies can become very difficult when the incident involves not only human and machine interaction but possibly the underlying safety culture of the organization. The current state of root cause analysis in· many plants is to stop after identifying human or hardware failures. In this work, root cause analysis is taken one step further by examining work processes and organizational factors. especially when management deficiency or human failure contribute to the incident Root cause analysis is best designed when the organization. as a whole. wishes to improve the overall operation of the plant by preventing similar incidents from

.� occurring again in the future. By focusing on the possible solutions, as well as the fault, the organization can begin to address problems hidden deeply within the work processes that operate, maintain, and support the plant.

* To whom correspondence should be addressed

Page 4: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

2� R.N. 940608

1. Introduction

Root cause analysis is a methodology used by nuclear power plants to help identify and isolate key contributing factors judged to be significant leading up to and during an incident. When the occurrence involves many factors" including human pedormance and/or management decision, identifying the root cause of the event may become very difficult, and in many cases. involve the underlying safety culture of the organization. Traditional methods of root cause analysis focus primarily on material deficiency and human error but stop short of looking deeper into the many work processes and organizational factors affecting everyday operation and support of the plant. In this work, a methodology is suggested to systematicaJlyexpand on traditional approaches to root cause analysis to 'incorporate organizational factors and work processes evaluation, thus probing deeper into the event allowing corrective actions to focus not only on the cause but also on improving the safety culture of the organization. .

2. Overview of C~rrent Root Cause Analysis

The methodology suggested by organizations, such as 'the International Atomic Energy Agency (IAEA), which runs the Assessment of Safety Significant Event Team (ASSE1). and the Nuclear Regulatory Commission (NRq,. which developed The Human Pedormance Investigation Process (HPIP) [US Nuclear Regulatory Commission, 1994] is designed to address latent weaknesses in the Nuclear Power Plant (NPP) which have resulted in an incident or accident Root cause analysis, in most cases, investigates why these weaknesses were not eliminated in a timely manner.

To expand root cause analysis to include work processes' and organization factors, this work uses as a case study the application of the ASSET methodology to a selected incident. ASSET analyzes significant events by preparing a descriptive narrative, establishing a chronological sequence of events, and preparing the· logic tree of occurrences which lead to the event. Significant occurrences in the logic tree are then investigated in detail and summarized in an Event Root Cause Analysis Form (ERCAF) [Reisch, 1994].

This work demonstrates that by expanding the ERCAF to include one additional section, i.e. specifically addressing possible latent weaknesses in the key work process(es). we can

.� improve greatly upon any corrective actions designed by the organization to include, not only preventative measures, but also improvements to the overall safety culture of the organization by improving that work process.

The ERCAF is divided into three sections (Table 1). The fi rst section describes the incident by stating specifically what failed to perform as expected, including the nature of the occurrence, i.e. an equipment, personnel or procedure failure. The second section

Page 5: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

3 R.N. 940608

addresses the direct cause of the incident by focusing on why the event occurred. This is done by looking into possible latent weaknesses of the failed component. The third section is directed toward the root cause by examining why the event was not prevented. or more specifically. the deficiency to timely eliminate any of the contributing latent weaknesses.

3. Work Processes

Since the NPP is organizationally best described as a machine bureaucracy. i.e. operated primarily by the standardization of work [Mintzberg, 1979], we focus our attention on the many work processes that operate. maintain and support the NPP. A work process is defined as a standardized sequence of tasks designed with the objective of achieving a specific goal within the operational environment of an organization. Most of the work processes at NPPs are described and controlled by written procedures. All procedures include an elaborate step-by-step set of instructions that are carefully documented to guide plant operators and maintenance crews through predicable job-related situations. The work processes in a NPP are designed to affect, .either directly or indirectly. the perforrna~ce Qf plant personnel and hardware [Davoudian, Wu and Apostolakis, 1994a]. The total number of work processes at a ·NPP may be very large; however, because we begin the work process evaluation with a specific incident in mind, we are usually limited to one or two. In this work, we will look at the corrective maintenance work process.

To evaluate the work process using WPAM, we look at the specific tasks that make up the corrective maintenance work process. The first task in the corrective maintenance work process is prioritization. When a plant component has failed or found to be in a degraded slate. a work request is initiated. The request must be prioritized with respect to all other outstanding or incoming requests. The defense or barriers to each task are designed by the organization to prevent failure. For Prioritization, this includes multiple reviews.· Once the different corrective maintenance work request orders have been prioritized, the next step involves planning and assembling the work package to carry out the evolution. The defenses or barriers for this task include Work Control Center (Wcq, Engineering and departmental reviews. The third task involves scheduling/coordinating the planned corrective maintenance between the many departments playing a part in the evolution. The barriers include interdepartmental meetings and reviews. Once the maintenance evolution has been planned and coordinated it is then carried out as per the work order request. The barriers include self verification, quality control, and post maintenance testing. When the maintenance has been completed and tested, it is then returned to service. The defense or barriers include self and independent verification. The last task in the corrective work process is always documentation.

Considerable research has been done on how organizational factors affect the everyday operation of nuclear power plants. Using the Work Process Analysis Model (WPAM) [Davoudian, Wu and Apostolakis, 1994a and bl, we begin to bridge the gap between the organization and NPP safety. The organizational factors are defined as the dimensions by

Page 6: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

4 R.N. 940608

which each task in the work process is affected the organization. Taking the qualitative aspects of WPAM one step further, it can be shown that root cause analysis can be

expanded in specific incidents where management deficiency and/or human performance factors are determined to be the underlying cause of an incident.

4. Loss or ofT-site power, Oconee, 1992

As a case study, we look at an incident resulting in a loss of off-site power. In 1989, a station report was initiated which requested replacement and upgrade of the existing batteries in the switchyard at Oconee Nuclear Station. In December 1990, the associated Nuclear Station Modification (NSM) was initiated. In May, 1992, the utility submitted a request for a revision to Technical Specifications in order to extend a Limiting Condition for Operation (LCO) from 24 hours to 7 days. This would allow one battery or associated DC distribution system panel to be out of service long enough to replace the batteries in accordance with the NSM. As part of the moditication package, two implementation procedures were developed. one for each"battery. During the development of these two procedures, it was decided that the preferred contiguration of the two DC buses would be to maintain separation of the buses, and· to use the associated battery charger as the only source of power for each bus as its battery was replaced. During this decision making process, personnel in Engineering and Operations were consulted and concurred.· After review, procedure TN/5/N2863/oo/AL2 "Replace 230KV SWYD Batteries SY-2" , was approved on October 15, 1992 [LER 05000-270.1992].

On October 19, 1992, while performing this maintenance, Oconee Unit 2 experienced a loss of off-site power, a generator load rejection. and a trip from 100% full power. A battery charg~r was placed in service without a connected battery. It produced excessive voltages which caused a series of spurious breaker failure relay actuations, locking out both buses in the 230 KV switchyard. Also, during recovery actions, shutdown of one emergency generator, after the emergency start signal had been reset, resulted in the unanticipated trip of the operating emergency generator leading to a second loss of power on Oconee Unit 2. The root cause of the event was determined to be management deficiency due to a less than adequate corrective action program.

The Licensee Event Report (LER) concluded that three specific factors combined to produce the event. First, the breaker failure relay zener diodes would pass a spurious signal when subjected to a greater than 200 VDC for two milliseconds or longer. Second, the 230 KV switehyard DC power system was being operated with the battery isolated from the bus with the battery charger acting as the only source of voltage. Third. the battery charger, when operated in this configuration. produced an output voltage which varied from approximately 70 to over 200 VDC.

Page 7: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

5 R.N. 940608

5. Conventional root cause analysis

The analysis done by the utility identifies the root cause of the incident to be management deficiency stemming from less than adequate corrective action, specifically poor planning

and execution of maintenance. Using the ASSET format (fable 1) to address the root . cause for the first occurrence, i.e. loss of off-site power, we see that the direct cause, as stated in the LER, was zener diodes in the breaker failure relays passing spurious voltage signals causing the breakers to trip open. Contributors to this weakness included a power supply that varied DC source voltage from approximately 70 - 200 VDC.

The root cause section addresses the deficiency to timely eliminate this problem. By studying the LER, it is found that there was inadequate detection of possible problems concerning operation .of the electric plant or battery charger in this configuration. The deficiency stemmed from a less-than-adequate corrective action to remove and replace the station batteries.

6. Including work prQCesses and organizational factors

By modifying the ERCAF to address possible lalent w~aknesses in the organization. it is possible link the organization directly or indirectly to the specific incident by including work processes and organizational factors in the analysis.

It is possible to identify the key' work process(es) and associated organizational factors playing significant roles in this incident. Both preventive and corrective maintenance are

. included in the maintenance program at all NPPs. Preventive maintenance is usually scheduled periodically to ensure plant components meet technical specifications and/or surveillance requirements while corrective maintenance refers to the repair and/or restoration of equipment or components which have failed or found, as a result of periodic testing. to be in a degraded state. The batteries were being replaced as part of a modification package with a· work package put into place to carry out this upgrade. By examining the standardizedsequence of tasks designed within the operational environment of the NPP to achieve this modification, we see that the corrective maintenance work process (section 4) is the best choice. If the incident took place while testing, starting up, shutting down or other special evolutions were taking place, we would have to direct our search to work processes that contain those specific sequences of tasks.

Using the task flow chart developed by WPAM, we can now construct an organizational� factors matrix. Its purpose is to show the organizational factors that may impact on the� safe performance of each task. Using this matrix, we can then determine which� organizational factors are involved in each task and its associated barrier. By taking this� matrix one step further, it is possible to prioritize and rank each organizational factor� within the associated task leading to a weighted organizational factors matrix (fable 2).� We can now easily identify the most signific.1nt organizational factors affecting each task� in the corrective maintenance work process. For example. from the organizational factors�

Page 8: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

6 R.N. 940608

matrix we see that the task of planning is affected by thirteen organizational factors. When we prioritize and rank these factors. we see coordination of work. technical knowledge, time urgency. problem identification and organizational learning appear to be the most significant

We now have a way to see the direct and/or indirect impact of the organization on each . task in the work process. If it is desired to look at every single organizational factor, regardless of its relative importance, then it is assumed that it would require greater cost and effort. The weighted organizational factors matrix is suggested as a tool to directly focus on the more salient dimensions.

The root cause of the loss of power at Oconee was determined to be management deficiency. By expanding this analysis. suggestions can be made as to what the deficiencies were. and suggestions can be made to improve the corrective maintenance work process in these areas.

The loss of off-site power occurred during the "execution" step of this work process, but. we' can learn even more by starting widrthe first task and working our way up to the incident. thereby seeing how certain factors may have been compounded from task to task.

Using the definitions of each organizational factor [Jacobs and Haber; 1994]. we can direct the analysis by asking specific questions about the events surrounding each task. To demons.trate this approach. we begin with the first task. prioritization, and look for possible latent weaknesses by looking at the more significant organizational factors. For example. addressing goal prioritization. we could look into instances· where plant personnel may have not understood. accepted nor agreed with the purpose and relevance of plant goals. From the LER we learn that,' in 1980. the vendor of the breaker failure relays had sent out "Product Reliability Letters" stating that these relays actuate spuriously if exposed to a 200 VDC differential for greater than 2 milliseconds. The letter also contained directions for a field change to correct .the problem. Although utility personnel reviewing the letters recommended making to changes. the relays were never modified, thus suggesting that this action was judged to be not of "high priority". This may suggest possible weaknesses in the way the organization prioritizes when economic considerations . are also weighed.

Looking specifically at the second task. planning. we see that the organizational factor, technical knowledge, is the most important. We also note that technical knowledge is also one of the most important organizational factors for the tasks of prioritization, scheduling/coordination. execution and returning to normal line-up. Oearly, technical knowledge plays a very important role throughout the corrective maintenance work process.

Page 9: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

7 R.N. 940608

Technical knowledge refers to the depth and breadth of requisite understanding plant personnel have regarding plant design and systems, and of the phenomena and events that bear plant safety. When we study this incident looking for specifics where a lack of

technical knowledge could have contributed in some way in the planning stages, we learn from the LER that the vendor manual for the battery charger provided some specifications for current and voltage stability while connected to a battery, but no data for operation without a battery. There was no specific statement prohibiting operation without a battery,

but setup instructions called for connecting a battery and all wording in the vendor documentation assumed that a battery was a]waysconnected. When the charger vendor was consulted, he stated that the chargers were not intended for use without a battery in the circuit.

Looking at another significant factor in the planning stage, organizational learning, we look for whether or not plant personnel and the organization used knowledge gained from past experiences to improve performance. Again from the LER, we find that a similar event had occurred at Vermont Yankee (VY) on April 23, 1991 (approximately 18 months prior to the Oconee incident). The·VY event had also involved operation with one switchyard DC bus powered by a battery charger (isolated from its associated battery), inadequate voltage control by the charger panialJy ~ue to failed components, and activation of breaker failure relays due to voltage surges associated with establishing the battery configuration. This event was evaluated as per the utility Operating Experience Program (OEP) and it was concluded that the equivalent portion of the circuit would not fail.the same way. The OEP di4 not discover that a different circuit was subject to the same failure mode, with the same result: actuation of the relay.

Another imponant factor for this task is problem identification. We focus here on how the organization encouraged plant personnel to draw upon their knowledge, experience, and current information to identify possible problems in the work package. Many departments reviewed the work procedure with none objecting to the switchyard line-up nor power supply configuration.

Similar analysis can be done on the remaining tasks, suggesting other possible latent weaknesses in the organization. For this paper, we only looked at the first two tasks.

By expanding the analysis to include work process evaluation, we have identified several organizational factors that possibly led to poor decisions by management. In particular, it is suggested that during the planning stages of this maintenance evolution, it was lack of technical knowledge, reluctance to use organizational learning and lack of foreseeing possible problems in the procedure that led to the loss of off-site power. With this expanded analysis. we have pinpointed areas within the organization that, when improved, not only improve the operation of the plant. but may increase the overall safety culture of the organization by improving the work process.

Page 10: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

R.N. 940608

By assessing its safety culture, an organizmion can determine where efforts need to be focused to improve the overall plant [Ostrom. Wilhelmsen and Kaplan, 1993]. The benefit of expanding root cause analysis to look additionally at work processes and organizational factors is that we assess and address safety culture when we look for latent weaknesses. Solutions to prevent future occurrences can now include improvements in the overall work process. As an example, from the LER, we learn that as a corrective measure, the utility revised the OEP to improve periodic assessments and effectiveness. From our expanded analysis, we can go beyond this improvement by fully appreciating the value of organizational learning in the planning of any standard maintenance operation. Organizational improvements and allocation of resources to improve organizational learning would not only improve OEP, but the numerous other NPP work processes that utilize this specific organizational dimension.

REFERENCES

Davoudian, K., Wu, J.S., and Apostolakis. G., 1994a, "Incorporating Organizational Factors into Risk Assessment Through the Analysis of Work Processes," Reliability Engineering and System Safety, 45, 85-105.

Davoudian, K., Wu, J.S., and Apostolakis. G., 1994b, "The Work Process Analysis Model,n Reliability Engineering and System safeey, 45, 107-125.

Jacobs, R., and Haber, S., 1994, "Organizational Processes and Nuclear Power Plant Safety,n Reliability Engineering and System Safety, 45, 75-83.

LER 05000-270. 1992. Loss of Off-site Power and Unit trip Due 10 Afanagement Deficiency, Less Than Adequate Correcei"oe A.clion Program.

Mintzberg, H., 1979, The Struclllre of Organizations, Prentice-Hall Inc., Englewood Cliffs, New Jersey.

Ostrom, L, Wilhelmsen, C. and, Kaplan, B.. 1993, "Assessing Safety Culture", Nuclear Safety,34, 163-171.

Reisch, F.• 1994, "The IAEA-ASSET Approach to Avoiding Accidents is to Recognize the Precursors to Prevent Incidents". Nuclear Safety, 35, 25-35.

US Nuclear Regulatory Commission. 1993, Development of the NRC's Human Performance Investigation Process (HPIPJ. NUREG/CR-5455.

Page 11: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

".�

9 R.N. 940608�

Event Title:

OCCurrence: What foiled to perform as expected? OcQX' Breaker fobe relays foiled to withstand excessive voltages

renee title:

Notue: Eqlipment fobe

Direct Cause: Why dd if happen? Corrective action Latent Zener dlodesln SF relays passed spurious voltage Breaker reIavs modified

weakness signal causing ACes to trip open. per vendor instnJctlons

Contributor DC power system was being operated with the MocIficaIion procecUe

to existence batt8ly isolated from the bus with the battery revised to maintain

of the latent charger octing os the 0ft0I SOU"ce ofvoltage busses tied together

weokness

Root Cause: Why was it not Drevented? Corrective action defk:iency t Inadequate detection ofpossible problems when Oitier 0C0nne

timely e1imin operoting battery charger wi1hout tiattery in circUt proceeues were r8llised

ate the and precautions odcIed

latent where appropriate

weakness

Contributor Monogement deficiencystemming from less OEP reWed for

to the than adequate COfl'ective action to perform enhoncemenIs to i1'lprove

existence of required mointenance' bothpr~ond

the periodic ossessments of

deficiency proarom effectiveness

Expanded Root cause: Which work process(es) and oraonlzationol factors Dlaved slanificant roles In the Incident? Latent Corrective action weolcnesses

it the Vaious deficiencies it COfl'ective maintenance Assess key orQOIlizaIiollOl

orgonizotior work process factors wiHn each tosk

Ieodingto

the incident

Contributors l)lock of tecMicol knowledge ond organizational Expand OEP to include

to the Ieornlng within the tosk of plornhg. improvemenls in

existence 2) lock of problem identification it various orgollizatiollOlleoming.

of the Deportment reviews prier to issuing wak order.

deficiency Implement schemes to

0S$9SS and upgrode pIont

tech. knowledge.

e.g. useof behoviotcl

c:heckists..

Table 1. Expanded root cause analysis tORn

Page 12: Expanding Root Cause Analysis To Include Organizational ... · 5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to

10 R.N.940608�

3 0

I i

I i

c

I f i I• 15

§~ .:s ~ &. I I•

g 0

.

crr.::~1lQI'l 5.7 3 3.65 4.42 121 17.2 ........ ._ ccrICIt\. Ex:et:'Ol 3.4�

1IQn. . .,,=-~::m~ 4.5 3.4 3.87 28 ~-:'!U'eonon•

on1:OC:~ 4.5 3.4 5.74 3.11 6.19 eoor::~ «Wett 15.5 18.9 '9.5

=on:-c;ZC:lQt1 . 5.3 4.9 4.8 726 10.6 42.3­iGooID:Io JINrOt 15.7 11.1

O'goillZcr.IollQl ~o- 5.7 4.4 6.5 224 ..... _. a.otr'it"O 13.2 7.6

. ~o 6.56�

or""Q' ICe ~1IQr1 5.7�

w·_ ~ 7.4 11.2 8.95 '9.54 13.7 ~ lc*\1ltcQ1lcn 9.3 9.72 14.9 ~~ 6 5.1 ..... 5.8 8 5.4 4.78 6.51 13.5 r.c:fY\CClf ~oe 16.8 14.6 13.2 12.3 12.6

r;"..~ 9.98 11.5 12.8 8.26 "'o~ 13.3 23.4 26.9

Tobie 2. W.tgtt~ OCgocllzanonal Fodon MaCdIt loth CocTecttv. Mok1I~. Worit PIoQesa (In~)