24
Root Cause Failure Analysis: Fact or Fiction? by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering Norcan Reliability Engineering Proprietary – Do Not Copy Page 1 of 24 Failure Occurrence Equipment failures are common occurrence in process industry. Hydrocarbon processing business has hazards inherent in its processing units and equipment. Some of those can have far reaching effects involving multiple fatalities, explosion, ecological disasters, spills, and fires. Incidents / accidents need to be investigated to find out and rectify deficiencies in the manufacturing system. Investigation of an accident is a window through which to view the existing manufacturing system, these deficiencies revealed, and benefits derived, which go far beyond correction of the immediate causes of the accident. Investigation of accidents is most traumatic experiences at each processing facility. Accidents / incidents are difficult to analyze due to technology and people complex interaction causing the incident. Competencies, management system and practices were instituted and controlled by corporate and local management, and suddenly requirements is put forward to analyze and evaluate current management system effectiveness. Incidents are investigated by the team comprising of personnel intimately knowledgeable of the work situation investigated and strengthened by internal and external subject matter experts. . Its commonly believed that incident have a “Root Cause” associated with fundamentals of how the organization carries out its activities relative to design, operation and maintenance of hydrocarbon production and processing. The primary aim of incident investigation is to identify the root cause of a problem in order to create effective corrective actions that will prevent that problem from reoccurring in other operations and circumstances. Process of Incident Investigation Focus and process of incident investigation can influence the results and outcome of the investigation. Investigation team will normally uncover what they are looking for. If the intension of the investigation is to devise quick solution based on brainstorming of ideas than the team will favor methodologies for Root Cause Analysis that easily documents the results from brainstorming as illustrated in Fig.1. However, if a team believes that it’s the work system, training, competence, was the root of current incident that they will focus on the management system. That choice is largely influenced by investigating team mission, regulatory authorities and company management. Figure 1: Brainstorming or facts?

by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

  • Upload
    others

  • View
    5

  • Download
    1

Embed Size (px)

Citation preview

Page 1: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 1 of 24 

Failure Occurrence Equipment failures are common occurrence in process industry. Hydrocarbon processing business has hazards inherent in its processing units and equipment. Some of those can have far reaching effects involving multiple fatalities, explosion, ecological disasters, spills, and fires. Incidents / accidents need to be investigated to find out and rectify deficiencies in the manufacturing system. Investigation of an accident is a window through which to view the existing manufacturing system, these deficiencies revealed, and benefits derived, which go far beyond correction of the immediate causes of the accident. Investigation of accidents is most traumatic experiences at each processing facility. Accidents / incidents are difficult to analyze due to technology and people complex interaction causing the incident. Competencies, management system and practices were instituted and controlled by corporate and local management, and suddenly requirements is put forward to analyze and evaluate current management system effectiveness. Incidents are investigated by the team comprising of personnel intimately knowledgeable of the work situation investigated and strengthened by internal and external subject matter experts. .

Its commonly believed that incident have a “Root Cause” associated with fundamentals of how the organization carries out its activities relative to design, operation and maintenance of hydrocarbon production and processing. The primary aim of incident investigation is to identify the root cause of a problem in order to create effective corrective actions that will prevent that problem from reoccurring in other operations and circumstances.

Process of Incident Investigation

Focus and process of incident investigation can influence the results and outcome of the investigation.

Investigation team will normally uncover what they are looking for. If the intension of the investigation is to devise quick solution based on brainstorming of ideas than the team will favor methodologies for Root Cause Analysis that easily documents the results from brainstorming as illustrated in Fig.1. However, if a team believes that it’s the work system, training, competence, was the root of current incident that they will focus on the management system. That choice is largely influenced by investigating team mission, regulatory authorities and company management.

Figure 1: Brainstorming or facts?

Page 2: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 2 of 24 

Team involved in investigation of incident can pursue inductive or deductive reasoning to devise a “Root Cause“.

Investigation process that facilitates fast team and management concusses is application of the deductive reasoning methodology.

Deductive incident investigation, Fig.2, starts with a general rule, a premise, which team know to be true, or team accept it to be true for the circumstances. Then from that rule, team makes a conclusion about something specific. Team takes a general premise and deduces particular conclusions of the incident. Team conclusions are only as good as team assumptions. Or, to put it another way, team presuppositions will always determine conclusions. Unless the evidence or observations are exhaustive and complete, the conclusion is only a guess.

Inductive incident investigation, Fig.2, is the process of arriving at a conclusion based on analysis of data. Team gathers together particular data in the form of assumptions, then it reasons from these particular assumptions to a general conclusion.

The certainty of team conclusion is entirely dependent upon team correct interpretation of the evidence and the consistency of the evidence with the remainder of the incident facts which was not, is not, or may never be observed.

The strength of the inductive team investigation is increased as it approaches completeness. If the team evidence represents all possibilities within the incident, team inductive conclusion will be correct. The more team can demonstrate that the evidence is truly representative, the more compelling will team conclusion be.   

Inve

stig

atio

n S

trat

egie

s -F

ocu

s

Time Line – Investigation Schedule

Inductive Incident Investigation

Deductive Incident Investigation

CU

LTU

RE

SH

IFT

TO

WA

RD

DA

TA B

AS

ED

IN

VE

ST

IGA

TIO

N

Ref. CCPS: Investigation Chemical Process Incidents

Figure 2, RCFA, Facts or Fiction

 

Management System Deficiency

The challenge posed to the team is to define current deficiencies in the management system as compared to what it should be, i.e. the benchmark performance. Than implies that the investigation team must be privy to the information on how the other benchmark plants are operated by the best in industry. Otherwise the point of reference is lost.

Standards Governing Hydrocarbon Processing Plant Operation

Operation of a hydrocarbon processing plant is governed by a set of standard, procedures, practices and methods. Those standards defined as management system are backbone of safety and effectiveness. Management system defines the rules of conduit of plant personnel and is company and site specific. Some of the standards contained in the management system are voluntary, developed by industry and corporation; as well as international and local industry associations.

PSM (Process Safety Management) as Standard of Operation Integrity

Mandatory OSHA's Process Safety Management Standard (29CFR 1910.119) [8] as shown in Fig.3 , came to effect in 1992 in USA and Canada, to provide the management system framework on how the hydrocarbon plant are to be safely operated, designed and maintained.

Page 3: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 3 of 24 

In addition to the USA many other countries have accepted OSHA's PSM as standard, model or a voluntary compliance. Full comprehension of the standard, as well as challenges to meet it, is an essential ingredient for achieving safe and effective hydrocarbon processing plant operation anywhere in the world.

OSHA's Process Safety Management Standard (29CFR 1910.119) came to effect in 1992 as a mandatory standard in USA and Canada.

In addition to the USA many other countries and companies (ARAMCO, SABIC, BAPCO, etc.) have accepted OSHA's PSM as standard, model or a voluntary compliance.

Full comprehension of the standard, as well as challenges to meet it, is an essential ingredient for achieving safe and effective hydrocarbon processing plant operation anywhere in the world.

PROCESS SAFETY MANAGEMENT ELEMENTS

No. Element Title

1 Process Safety Information

2 Employee Involvement

3 Process Hazard Analysis

4 Operating Procedures5 Training6 Contractors7 Pre-startup Safety

Review8 Mechanical Integrity9 Hot Work10 Management of

Change11 Incident Investigation

12 Emergency Planning and Response

13 Compliance Audits14 Trade Secrets

Process Safety Management

Figure 3: Outline of PSM Standard

The intention of PSM [8] is to establish a comprehensive management program to integrate technologies, procedure and management practices in a holistic approach aimed at prevention or minimizing the consequences of catastrophic releases of toxic, reactive, flammable, or explosive chemicals. This stipulates proactive identification, analysis and control of the potential hazards. The emphasis is to manage the risk, because the risk can never be completely removed by design, operation and maintenance practices.

Definition of the Root Cause RCFA Inductive Process investigates the Root Causes of an incident by considering incomplete and ineffective elements of the PSM system as describer in Fig. 4.

A fundamental, management system reason

(Process Safety Management System)

that the event occurred and that management have the capability to correct.

Definition of the Root Cause

Figure 4: Root Cause

Page 4: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 4 of 24 

Root cause is failure of the management system to control hazards, as stipulated by PSM, including employee involvement and team work in proactive application of engineering disciplines, techniques, skills and data to achieve the agreed plant operating pattern, product quality, within the accepted plant conditions and safety standards, at optimum resource costs and acceptable risks.

Failure Causation Model

Figure 5: Failure Causation Model

The stages in the accident causation model are:

The Active Failures, illustrated in Fig. 5, appear shortly before the incident, they are noticeable and they consist of breached system defenses due to lack of skill, rule, knowledge and / or violations. The active failures are caused by production personnel, maintenance personnel, etc.

The Supervision & Condition Issues / Deviations include fallible decisions regarding Tools and Equipment, Work Standards, Operating outside Limits, Work Culture, Competence, Skill, Stress, Motivation and Orientation.

PSM Elements Incompleteness and Ineffectiveness may be within the system for a long time. They include poor implementation regarding Employee Participation, Process Safety Information, Process Hazard Analysis , Operating Procedures, Training , Subcontractor Safety , Pre-Startup Safety Review , Mechanical Integrity , Non-routine Work Authorizations (Hot Work Permits) , Management of Change, Incident Investigation and Emergency Planning and Response.

The Audit of the Implemented PSM Elements is essential to ensure completeness and effectiveness. The Audit should define gaps and ensure closure of discovered issues in Employee Participation, Process Safety Information, Process Hazard Analysis , Operating Procedures, Training , Subcontractor Safety , Pre-Startup Safety Review , Mechanical Integrity , Non-routine Work Authorizations (Hot Work Permits) , Management of Change, Incident Investigation and Emergency Planning and Response.

Sequence of Events, Defining Deviation

Page 5: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 5 of 24 

Sequence of Events provides a structure for investigators to organize and analyze the information gathered during the investigation, identify gaps and deficiencies as the investigation progresses.

The Sequence of Events chart, with logic sequence, describes the events leading up to an occurrence, plus “the conditions less than expected” surrounds these events.

Team needs to look far enough, Fig. 6, to uncover defect put into system and equipment due to errors in design, fabrication, operation or maintenance. In many instances equipment is design with flaws that makes if liable to explode under some circumstances. Those conditions and defects inherited by deficiencies in design, operation and maintenance need to be identified and analyzed as relevant to the current investigation.

Deviation Considering all Life Stages of the Plant

P l a n :

A c t u a l :

D a t e / T i m e

F l o w Deviation

DEVIATION

Design Engineering Start-up

MaintenanceAlteration

Construction Operation

Unstable Operation due to Induced Errors

Scope to be Covered by Sequence of Events

Figure 6: Time Line for Sequence of Events

RCFA Inductive Process use the Failure Causation Model, illustrated Fig. 7, to analyze the deviation (“Performance Less Than Expected”) leading to accidents and as a framework for accident/incident investigation.

Critical Elements in the Sequence of Events are analyzed for each elements of the Failure Causation Model.

The RCFA Inductive Process encourages the study of the degraded management system elements that may have laid dormant for a long time (since design, fabrication, maintenance and operation) until they finally contributed to the accident.  

Page 6: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 6 of 24 

Figure 7: Sequence of Events incorporating Failure Causation Model

Risk Due to Degraded Management System

Degradation of PSM is frequently observed in any industrial operation today. Degradation can completely change risk profile of the plant. If the effective management system is not there, reflected degraded PSM elements some low probability scenarios could become real hazards reflected by the increased risk levels. The deficient PSM elements need to de identified and managed.

Figure 8: Risk Increase due to Degraded PSM System

Some possible reasons for degradation would be: • New product and equipment requiring training • No field organization to implement program • No follow up on training • Not keeping up to date with current technology • Producing more efficiently to remain competitive

Page 7: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 7 of 24 

Whenever a additions, new plant and equipment, new personnel are made, large or small, permanent or temporary, managers and staff should assess the possible impact of the change to the PSM system in place.

Effectiveness Evaluation of PSM System Elements

Critical PSM elements defined by Sequence of Events need to be assessed for effectiveness considering: Are the element applied in the field, are they developed for current operation or general in nature or specific for the operation at hand, etc?

Due to required PSM compliance oil & gas industry has franticly produced a large volume of documentation that might have not reached a shop floor for application.

Figure 9: Criteria for Assessing Effectiveness of PSM Elements

Risk Evaluation

Key deliverable of Incident Investigation is evaluation of the risk, Fig.10, due to degraded PSM system. When operating with incomplete or ineffective PSM management system, the reactive effort may become a treat to progress and survival.

RCFA Inductive Process provides a risk metrics, Fig. 10, to identify and analyze PSM risks is by assessing the amount of variance between the completeness and effectiveness of analyzed steps in Sequence of Events.

Success of PSM risk reduction efforts associated with this approach will depend on the ability and willingness of the Investigation Team to make a concerted effort to replace any deficient PSM practices and procedures with industry Best Practices. In practical terms, a risk assessment is a thorough look at PSM completeness and effectiveness to identify those activities, equipment, situations, processes, etc that may cause harm, particularly to people.

One of the primary benefits of the RCFA Inductive Process approach is that it addresses pervasive and subtle sources of risk in PSM programs and uses fundamental principles and proven procedures to reduce risks. The RCFA Inductive Process methodology requires that risk with incomplete and ineffective PSM be evaluated as part of incident investigation. This involves proactive and systematic identification, evaluation, and mitigation or prevention of risk in current operation for chemical releases that could occur as a result of failures in process, procedures, or equipment due to deteriorated PSM system.

Page 8: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 8 of 24 

RCFA Inductive Process risk assessment process is:

Identify hazards associated with noncompliance to PSM element for each step of the Sequence of Events, Analyze or evaluate the risk associated with that hazard, and Determine appropriate ways to eliminate or control the hazard.

After identification is made, The RCFA Inductive Process guide the investigator to evaluate how likely and severe the risk is, and then decide what measures should be in place to effectively prevent or control the harm from happening.

Since deviation to PSM is a common cause of accidents, by managing PSM system, we are managing potential incidents.

Figure 10: Risk Table considering Completeness and Effectiveness of PSM Elements

The PSM program should be evaluated by considering the actual PSM performance as compared to stated PSM objectives and requirements. This PSM program should be compared to a baseline of those industry-wide processes and practices that are critical to the PSM program. The variances between the two baselines are indications of the PSM risk present in the program. These results should be documented in a standard format to facilitate the development of a risk handling/mitigation and risk tracking plan. Assessments should be done by a competent team of individuals who have a good working knowledge of the workplace. Team personnel involved should always include supervisors and workers who work with the process under review as they are the most familiar with the operation.

RCFA Inductive Process assessment of incidents is used to answer the following questions: • What has and could go wrong? • What PSM elements are affected? • What is potential consequence in completeness and effectiveness of PSM? • How could it affect me or others? • How likely is it to happen?

Page 9: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 9 of 24 

• What can I do about it?

Risk Control

Processing plants operating in the reactive mode could lack an understanding of what needs to done to achieve proactive, knowledge based, operation, maintenance and redesign for the benefits of OSHA PSM compliance as well as business improvements.

The results of the investigation should be a direction for transformation to the knowledge based management system. Risk associated with deficient management system needs to be evaluated and brought to the acceptable level.

Implementing Improvements

What will PSM system deliver?What does good PSM system look

like?

What are the most important parts of PSM effectiveness. Personnel and management responsible for controlling

effectiveness?

Set a lagging indicator

to show whether or not the PSM

System completeness is being achieved.

PSM SYSTEM COMPLETENES

PSM SYSTEM EFFECTIVNESS

Set leading indicatorsAgainst key parts to show controls are working as

intended.

What can go wrong?

PHA

What risk control systems, PSM are in place to control risks?

DUAL ASSURANCEthat risks are being

effectively managed.

Follow-up on RCFA findings to rectify faults in the Process Safety

Management system.

Regularly Review performance against all indicators to check

completeness and effectiveness of PSM system and suitability of

indicators.Ref. FMRIMS@Root ™

Figure 11: Implementing Completeness and Effectiveness Actions

Some organizations have implemented improvements on a smaller scale and reported success, by doing the following: Start with a small process that can be completed in a short time frame. Set clear timelines. Do not spread resources thinly and focus on the short term payoff. Management and primary stakeholders must be involved, or else even a limited implementation will fail.

Risk Control can incorporate elements:

PSM Owner Actions: The PSM owner is the person who is responsible to design the processes necessary to achieve the objectives of the business plans that can lead to complete and effective PSM. The process owner is responsible for the creation, update and approval of documents (procedures, work instructions/protocols) to support the process. Many PSM process owners are supported by a process improvement team. The process owner uses this team as a mechanism to help create a high performance process. The PSM process owner is the only person who has authority to make changes in the process and manages the entire process improvement cycle to ensure performance effectiveness. This person is the contact person for all information related to the process.

PSM Improvement Plan: The PSM process owners create and own the process performance objectives of the organization. The PSM process owner first needs to understand the external and internal customer requirements for the process. This person uses the business plans as a source to help understand the long term and short term customer and business requirements. This person then translates these requirements into process performance objectives and establishes product (includes service) specifications. This person establishes process performance metrics to measure the PSM process’s capability to meet the product Health & Safety, economics and environmental objectives. The set of metrics are called key performance indicators (KPIs). The PSM process owner then designs process steps to describe work that when performed will have the capability to produce product that meets the customer and business requirements.

Implement Operator Knowledge Base: Depository and records of past Incident Investigations is a corporate knowledge base. This knowledge base need to be managed and frequently revisited to ensure corporate

Page 10: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 10 of 24 

compliance to the implemented management systems (PSM). Seldom corporations have such a data base, as shown on Fig.12, located on the Intra-web corporate platform to guide current investigation and provide records of past investigations!

 

Figure 12: Web Based RCFA Management

 

List of References:

[1] FMRIMS@Root ™, http://www.norcanreliabilityengineering.com

[2] Swain, A.D. & Guttmann, H.E., Handbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Applications. 1983, NUREG/CR-1278, USNRC.

[3] Humphreys, P. (1995). Human Reliability Assessor’s Guide. Human Factors in Reliability Group.

[4] "The Human Factors Analysis and Classification System (HFACS)," Approach, July - August 2004. Accessed July 12, 2007.

[5] Reason, J.[1990] Human Error. Cambridge University Press

[6] Standards Association of Australia (1999). Risk management. North Sydney, N.S.W: Standards Association of Australia. ISBN 0-7337-2647-X.

[7] United States Environmental Protection Agency (April 2004). General Risk Management Program Guidance. United State Environmental Protection Agency. http://www.epa.gov

[8] OSHA PSM 1910.119; Process safety management of highly hazardous chemicals http://www.osha.gov  

Page 11: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 11 of 24 

Namik Kosaric is a Canadian Professional Engineer with 

experience with Norcan Reliability Engineering (current), 

PETRONAS, Bahrain Petroleum Company and ESSO 

Petroleum Canada in reliability improvements and 

maintenance cost reduction, mechanical design, project 

engineering and technical support of Oil Refineries and Oil 

Production Facilities.  

For the last 8 years in PETRONAS Namik Kosaric was 

responsible for providing technical and knowledge leadership 

in development, coordination and implementation of plant 

reliability and integrity improvements and program to 

PETRONAS OPU’s  to improve and support the overall 

Petroliam Nasional Berhad objectives.  

In BAPCO, Namik Kosaric, pioneered and implemented a root 

cause failure analysis of lost profit opportunities and chronic 

failures using a multi‐disciplinary teams to improve plant 

reliability, availability, safety and to ultimately reduce 

operating costs. Significant cost savings were achieved as a 

result of over 200 completed investigations. 

For 23 years in ESSO Petroleum Canada, Namik Kosaric has 

made significant contribution worldwide in reliability 

improvements, design, projects and maintenance cost 

reduction in upstream and downstream facilities.    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Page 12: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 12 of 24 

Root Cause Failure Analysis: Fact or Fiction? Part 2 

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Challenge of Incident Investigation

The RCFA Inductive Process considering the Risk of the degraded PSM elements is a systematic method in incident investigation to analyze the integrity of current PSM system, as defined as policies and procedures, which ensure that changes do not result in operations outside of established safety parameters. This need to become essential component of a plant’s process safety system as degradation and non compliance occurs daily in a process plant.

Hydrocarbon Processing Plant Complexity Incident investigation challenge is in obtaining knowledge and understanding of how work and decision making is carried out within current organization, of the current PSM management system and PSM management system effectiveness. Investigation team to assess:

Operation Changes: The operating plant environment is changed on the daily basis. Although highly automated hydrocarbon processing plant are subject to frequent operation parameter changes as well as equipment failures. There could be thousand of alarms that go on and are lift unattended in one day of operation. In addition, hydrocarbon processing plants are becoming more complex and tightly coupled between process units and operating plants. And interconnections between plants and functions are frequently not automated.

Equipment Modification: Hydrocarbon processing plants are exposed to flux of changes implemented by the personnel involved in redesign, operation and maintenance. Large hydrocarbon processing complex can issue thousands upon thousands maintenance work orders per year as well as to be involved in large number of abnormal operating scenarios! Frequently employees are overwhelmed by magnitude of changes and issues.

Technological Advances: Process changes, improvements, as well as technological advances result in an increase in competence requirements. As the equipment complexity increases, understanding of the equipment implications relative to maintenance and operating change can becomes critical element for safe operation. A characteristic feature of the complex hydrocarbon processing plan is that their complexity often exceeds employee grasp of how facilities are designed and how should they be operated and maintained.

Redundancy: Current high prices of crude oil have put pressure on production and processing capability of existing

facilities. In order to meet targets availability, equipment & system redundancy is implemented. A high level of redundancy is easily justified for increasing reliability, and possibly safety. Reliance on redundancy may lead to decreased emphasis on PM, monitoring, operating envelopes, as well as other safety engineering techniques. In practice, redundancy may ‘cover up’, or mute, operating, maintenance and design errors and prevent them from becoming visible until something catastrophic occurs.

Investigating team need to identify causes, ineffective management system causes, that drive human performance by comparing to the PSM system benchmark, i.e. plants with pacesetting HSE and economic performance.

Example: Electrical Motor Failure Application of the RCFA Inductive Process considering risk of the degraded PSM elements will be illustrated by failure analysis of Starhill Platform Crude Transfer Pump. The example is illustrative and has no resemble to any actual installation, facility or company. The pump was installed in 1988 and was operating with expected reliability till year 2000. Six pump and motor failures have occurred in the last four years, four have required replacement with a factory new motor at a cost of US$ 150,000 each. Failures are characterized by failed bearings, a bent rotor shaft and damage to rotor and stator from heavy rubbing contact.

Page 13: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 13 of 24 

Figure 1: Example Investigation: Pump and Electric Motor Failure

Sequence of Events: Electrical Motor Failure

Fig. 2, The Pump and Motor Failures at Starhill Platform describes Sequence of Events.

Sequence of Events for Pump Failures at Starhill Platform, describes timing, who is involved, details of the sequence, outline / definition of performance less than expected for that sequence, agent involved in this change (PSM procedure or inadequate facilities), and consequence of performance less than expected. There are two distinctive periods that can be observed from the sequence of events, expected operation from 1988 till 2000, and high failure rates after year 2000.

Page 14: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 14 of 24 

Figure 2: Sample Sequence of Events

Sequence of Event, Fig. 2, identifies the change, “performance less than expected” and procedure / standards / practices that used control this step. Focus of investigation is a period of high failure rates. Each sequence, Fig.3 below, are evaluated for Active, Condition and Supervision and PSM System failure.

Analyze Sequence Step for Active, Contributory and PSM 

deviation 

Figure 3: Assessment Model

Analysis of the period after year 2000 should go deep enough to uncover issues at Starhill platform relative to active, precondition and supervision and degraded PSM system causes. Issues that need to surface could have been been created due to inadequate design, fabrication, construction as well as poor operation or maintenance. Everything happens for reason and those need to be fully understood to prevent reoccurrence Deviation Statement: Electrical Motor Failure Each event in the Sequence of Events need to be defined for: Details of Event, Change (Performance Less than Expected), Agent of Change and Effect of Change. Change in Operating Condition involving Water, Sand and Slugs Carry Over was a significant factor that warrants further investigation of this event.

Page 15: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 15 of 24 

Figure 4: Assessment of Step in the Sequence of Events

Scope of Investigation: Electrical Motor Failure Change in Operating Condition involving Water, Sand and Slugs Carry Over is further investigated relative to Active, Precondition and degradation of PSM elements.

Investigation is performed by application of qualitative techniques defining assessment dimension with rating for completeness and effectiveness considering PSM system degradation. RCFA Inductive Process, considering PSM degradation, is a technique backed up by web based software with quite extensive questionnaire involving assessment of completeness and effectiveness of each PSM dimension.

The following, Fig.5, is outline of the RCFA Inductive Process management system investigation Menu:

Page 16: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 16 of 24 

Figure 5: Scope (Menu) for the Event Assessment

Once a sentinel event has been identified for analysis, a multidisciplinary team is assembled to direct the investigation. The members of this team should be trained in the RCFA Inductive Process techniques and its goals, as the tendency to revert to personal biases is strong. Multiple investigators and management interference can allow triangulation or corroboration of major findings and increase the validity of the final results.

At the conclusion of the RCFA Inductive Process, the team summarizes the management system causes and their relative contributions, and begins to identify administrative and systems problems that might be candidates for redesign.

PSM Element Completeness: Pump & Electrical Motor Failure

RCFA Inductive Process utilizes specific criteria’s, Fig. 6 and 7, to assess the completeness and effectiveness of each element of PSM. Those criteria’s require assignment of ranking in values from 1 to 4. Completeness, Fig. 6, is to be ranked in accordance to:

PROCEDURE ASSESSMENT

COMPLETENES

RATING 4: PSM System is mature, having undergone feedback, review, and improvement cycle

Revisions to PSM System documentation have been made, if required.

RATING 3: PSM System is deployed.

Procedures for PSM System tasks, based on risk, are documented

Ongoing monitoring and measurement.

RATING 2: The five Characteristics of PSM System are documented, approved, and resourced. (Scope, Resource, Time, Quality, Risk)

Procedures for key PSM System tasks, based on risk, have been identified and are under development

Deployment is underway.

RATING 1 :A documented PSM System is being developed to address the potential hazards of the operation and to improve performance.

Page 17: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 17 of 24 

Figure 6: Criteria for Completeness Assessment

PSM Elements Effectiveness: Pump & Electrical Motor Failure

The following, Fig. 7, is the outline of effectiveness scoring:

Figure 7: Criteria for Effectiveness Assessment

PSM Elements Degradation: Pump & Electrical Motor Failure

RCFA Inductive Process uses the software with build in methodology to assess PSM elements completeness and effectiveness as related to each step of the Sequence of Events. The following is an example of the assessment as related to the example: The Pump Failures at Starhill Platform, Sequence of Events, Event 02, Period of Frequent Failures:

PSM Mechanical Integrity Assessment: Pump and Electrical Motor Failure The following RCFA Inductive Process questionnaire examines the content of the knowledge base that should be available to operator and its lack of completeness and effectiveness. Assessment of Sequence No.2 relative to PSM Mechanical Integrity is illustrated below:

Page 18: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 18 of 24 

Evaluate “Mechanical Integrity”

Figure 8: PSM Mechanical Integrity Assesssment

As per assessment Table there is no defect (failures) control at Starhill platform. The operator and mechanics were not provided baseline mechanical integrity data and data on past repair was not available, Fig. 8 Investigation Team has provided very low rating on operator’s awareness of equipment date base, limitations, PM program and equipment maintenance data. The equipment maintenance data base can include Equipment Maintenance and Reliability Strategy, Reliability Targets, Operating Limits, Material of Construction, specification sheets, time consumed, impact, and costing. Schedule, Annual Preventive Maintenance calendar need to be available and displayed for operators, with work instructions. Data should be made available of values derived using calibrated instruments as compared to Design Specification. Agreements on deviations to schedules and any resultant impact should be documented and provided to operators. Investigation Team has provided a low rating of Operator involvement in PM and Operational database, understanding of Mean Time Between Failure information for increasing On-time service hours, knowledge of base for Predictive Maintenance,

Page 19: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 19 of 24 

Reliability Engineering data for equipment and components, Plant Life Extension studies , Remnant Life Analysis and Sparing philosophy. Investigation Team has provided a low rating of testing and inspection programme, Engineered Controls, Vents, Drains, Flare Lines, Elbows, Bends, Turbulent Spots, Spring hangers, Pipe supports, Foundation Settlements, Stack Top Segments, Critical Isolation Valves and Non-Return Valves. Investigation Team has provided a low rating for inspection techniques used to ensure integrity, reference to appropriate Inspection Standards, Individual Competency, Instrument and Result interpretation.

PSM Training Assessment: Pump & Electrical Motor Failure PSM Training questionnaire, Fig. 9, focuses on the competence of the Operating personnel, training and the experience gained by hands-on work. Sequence No.2 The Pump Failures at Starhill Platform relative to PSM Training: Assessment of training is essential to understanding of cases leading to failure in Sequence No.2

.

Evaluate “Training”

Figure 9: PSM-Training

Page 20: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 20 of 24 

Investigation Team has provided a low rating for Training, no clear Goals and Objectives, not based on Task analysis, Skills Gap analysis, personnel evaluation criteria, with a structured calendar, participant profile, venue selection for ambience, Trainer choice, duration prescribed, subject topics relevant to the participants. Employees at Starhill platform were overwhelmed by magnitude of changes and issues. The constraint at Starhill Platform is employee understanding of processing technology and the systems in which facilities design, operation and maintenance is embedded.

Investigation Team has provided a low rating of Levels of Training as published both for budgetary purposes as well as progress evaluation. This should be treated as an on-going activity preferably with a specified budget and an absentee list. A characteristic feature of the Starhill Platform is that the complexity exceeds employee grasp of how facilities are designed and how should they be operated and maintained. In complex, automated systems, such as Starhill Platform, management system accidents are predominant.

Completes and Effectiveness of PSM Elements: Pump & Electrical Motor Failure Investigation Team has provided rating for the degraded PSM elements relative to failure analysis of Starhill Platform Crude Transfer Pump.

RCFA Inductive Process proprietary software and methodology has summarize, Fig. 10, a degradation assessment of each PSM element as related to Steps 02 in the Sequence of Events.

Investigation Team has provided rating for the degraded PSM element to prioritize elements where a departure from standard procedures or specifications in PSM results in non-conforming processes or where there have been unusual or unexplained events which have the potential to impact on production, system integrity or personal safety.

Investigation Team rating for PSM degradation is summarized below:

“PSM Evaluation Summary Results”

Figure 10: PSM Evaluation Summary Score

RCFA Inductive Process Root Cause Failure investigation provides software and methodology to evaluate the Deviation & Consequence and assess the potential impact to the Health & Safety, Economics and Environment.

Page 21: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 21 of 24 

If a critical or serious deviation is uncovered corrective and preventive actions could be determined and follow up tasks should be assigned.

Risk Assessment due to Degraded PSM System: Pump & Electrical Motor Failure Investigation Team rating for PSM degradation relative to Events in Incident Investigation, as summarized in Fig.10, are evaluated for the risk due to degraded PSM system.

Investigation Team assessed the risk due to degraded PMS, Fig. 11, by analyzing the amount of variance between the completeness and effectiveness of analyzed steps in Sequence of Events. Investigation Team has rating is:

Risk associated with noncompliance to PSM element for Event 02 of the Sequence of Events is CRITICAL, High Variance and High Consequence,

Risk associated with that hazard need to be reduced minimum variance to PSM Investigation Team to Determine appropriate ways to eliminate or control the hazard, variance to PSM.

Further consideration need to ascertain that deviation to PSM is a common cause of accidents, by managing PSM system we are managing potential incidents.

Figure 11: RCFA Inductive Process Risk Metrics

Defining Improvements to Reduce Risk: Pump & Electrical Motor Failure

RCFA Inductive Process should define improvements, Fig. 19, and prioritizes risks associated with PSM non compliance followed by application of resources to minimize, monitor, and control the probability and/or impact of failures / incidents or to maximize the realization of opportunities.

Page 22: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 22 of 24 

Figure 12: RCFA Inductive Process Root Cause and Improvements Action

Investigation Team has provided recommendations dealing with fundamentals of PSM in employee, involvement, hazard analysis, operating procedures, training and mechanical integrity. The recommendations define improvements and manage risk due to PSM degradation include transferring the risk to another party, avoiding the risk, reducing the negative effect of the risk, and accepting some or all of the consequences of a particular risk.

Investigation Team recommendations prioritization is used, whereby the risks with the greatest loss and the greatest probability of occurring are handled first, and risks with lower probability of occurrence and lower loss are handled in descending order.

Implementing Improvements: Pump & Electrical Motor Failure Investigation Team recommendations incorporate fundamental culture change to avoid resistance:

Fundamental revisit to PSM implementation to offset manager’s resistance to change existing structures. Fundamental revisit to PSM implementation to offset labor force resistance due to fears of additional work, job

security, layoffs!

Way Forward: Pump & Electrical Motor Failure

Page 23: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 23 of 24 

Investigation Team will uncover what they are looking for. If there is a common belief that accidents take place due to faults in the manufacturing management system, they will probably focus on deteriorated PSM system.

However, if the shared belief is that fact consensus is preferable, that brainstorming is the way to go.

Safety and effectiveness vary significantly in the industry. Similar plants can spend third in maintenance cost to achieve pacesetter safety, availability, reliability and product quality.

Oil & Gas Industry need to have an appreciation of a benchmark, how is the stellar performance is accomplished by the best in industry. For Oil & Gas processing plants with poor safety and operating effectiveness, plant output does not correlate with resources utilized: the resources are consumed, but the resource consumption stubbornly fails behind the potential accomplished by pacesetters. Poor performing oil & Gas plants feature operating and maintenance teams not equipped to solve safety and reliability problems, poor availability, operating equipment beyond design limitations, poor repair, safety and maintenance effectiveness, ineffective contractor policy, large and ineffective spare material inventories and etc. Way forward for Oil & Gas industry could be to promote a benchmark model for PSM effectiveness and use this as a guide for PSM assessment and incident investigation.  

List of References:

[1] FMRIMS@Root ™, http://www.norcanreliabilityengineering.com

[2] Swain, A.D. & Guttmann, H.E., Handbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Applications. 1983, NUREG/CR-1278, USNRC.

[3] Humphreys, P. (1995). Human Reliability Assessor’s Guide. Human Factors in Reliability Group.

[4] "The Human Factors Analysis and Classification System (HFACS)," Approach, July - August 2004. Accessed July 12, 2007.

[5] Reason, J.[1990] Human Error. Cambridge University Press

[6] Standards Association of Australia (1999). Risk management. North Sydney, N.S.W: Standards Association of Australia. ISBN 0-7337-2647-X.

[7] United States Environmental Protection Agency (April 2004). General Risk Management Program Guidance. United State Environmental Protection Agency. http://www.epa.gov

[8] OSHA PSM 1910.119; Process safety management of highly hazardous chemicals http://www.osha.gov  

Page 24: by Namik Kosaric P.Eng. and Killian Wagner, Norcan ......14 Trade Secrets Process Safety Management Figure 3: Outline of PSM Standard The intention of PSM [8] is to establish a comprehensive

Root Cause Failure Analysis: Fact or Fiction?  

by Namik Kosaric P.Eng. and Killian Wagner, Norcan Reliability Engineering

Norcan Reliability Engineering Proprietary – Do Not Copy                                                                                                                                        Page 24 of 24 

Namik Kosaric is a Canadian Professional Engineer with 

experience with Norcan Reliability Engineering (current), 

PETRONAS, Bahrain Petroleum Company and ESSO 

Petroleum Canada in reliability improvements and 

maintenance cost reduction, mechanical design, project 

engineering and technical support of Oil Refineries and Oil 

Production Facilities.  

For the last 8 years in PETRONAS Namik Kosaric was 

responsible for providing technical and knowledge leadership 

in development, coordination and implementation of plant 

reliability and integrity improvements and program to 

PETRONAS OPU’s  to improve and support the overall 

Petroliam Nasional Berhad objectives.  

In BAPCO, Namik Kosaric, pioneered and implemented a root 

cause failure analysis of lost profit opportunities and chronic 

failures using a multi‐disciplinary teams to improve plant 

reliability, availability, safety and to ultimately reduce 

operating costs. Significant cost savings were achieved as a 

result of over 200 completed investigations. 

For 23 years in ESSO Petroleum Canada, Namik Kosaric has 

made significant contribution worldwide in reliability 

improvements, design, projects and maintenance cost 

reduction in upstream and downstream facilities.