Upload
ronald-bartels-
View
360
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Root cause analysis
Citation preview
Root Cause Analysis
Like icebergs, most of the problem is usually below the surface!
Investigating causes of failures & mishaps
Stop and ask yourself…
Did you really find the causes of the failure?
This is NOT Root Cause Analysis
Technical Proficiency
• Once the accident happened how did Gene Krantz rely on the skills and expertise of his people?
• How did Lovell work to initiate actions in the spaceship? Was he able to balance that with his technical responsibilities in the craft? How did he do it?
• What steps does your unit take to maintain Technical Proficiency?
Les
sons
fro
m A
poll
0 13
Teambuilding
• How did Lovell contribute to the group process when Mattingly wanted to practice the docking procedure again after 3 hrs of practice?
• When Krantz had the team in the classroom how did he establish the goal and then how did he go about motivating others to achieve the goal of returning the space craft safely to earth?
• Did Lovell make the right call when faced with the challenge of forcing Mattingly to stay behind because of the fear of measles?
• How does a leader successfully build a strong team, but then separate him or herself from the Team to make a critical decision?
• How’s your Team doing? Les
sons
fro
m A
poll
0 13
Effective Communications
• Even as everything is breaking loose in Mission Control, Gene Krantz asks his team to “Work the Problem.” He then listened to the experts report in on their areas of the mission. How did his effective comms set the stage for a successful recovery?
• Krantz stated “Failure is not an option” and Lovell told his crew “I intend to go home.” By clearly stating their ideas and vision how did it direct the teams towards mission accomplishment?
• Whose the best communicator you’ve ever worked with? What made them excel? Les
sons
fro
m A
poll
0 13
Vision Development & Implementation
• JFK’s Vision: "I believe that this nation should commit itself to achieving the goal, before this decade is out, of landing a man on the moon and returning him safely to Earth.“
• How does a stated vision focus the unit and bring the crew together?
• Lovell states; “Columbus, Lindberg, and Armstrong; it is not a miracle for man to walk on the moon, we just decided to go.”
• What’s the vision at your unit? Has everyone decided “to go?” What can your unit do to get everyone “on board”? Les
sons
fro
m A
poll
0 13
Conflict Management
• How did Lovell deal with stress and conflict in the LEM?
• How did the CO2 challenge help the crew to overcome the conflict they were experiencing?
• Is there more or less conflict when people are busy and focused or when there is less to do and folks have time on their hands? Why?
• How did Krantz and Lovell go about alleviating conflict between the crew and the Medical team?
Les
sons
fro
m A
poll
0 13
Decision Making & Problem Solving
• How did the Team live the Competency of Decision Making and Problem Solving in working the “Power” problem to conclusion?
• Right after the explosion Krantz’s asks Mission Control “What do we have on the Space Craft that’s good?”
• Why did he ask this question?
• How did it aid in making the correct decision to shut down the fuel cells?
• Does everyone at your Teamt ensure that the Decision Makers have all the available and correct information? Why or Why not?
Les
sons
fro
m A
poll
0 13
Creativity and Innovation
• We’ve discussed a lot of positive leadership qualities during this session. How did Gene Krantz create an environment with his Mission Control team to ensure they were able to figure out how to solve the CO2 problem with a “Square Peg in a Round Hole!”
• Lovell states at the end of the movie; “Thousands of people worked to bring the 3 of us back home.” How did creativity and innovation make the “Successful Failure” a reality?
• How does your unit build on Lessons Learned?
Les
sons
fro
m A
poll
0 13
Investigating causes of failures & mishaps
When performing an investigation, it is necessary to look at more than just the immediately visible cause, which is often the proximate cause.
There are underlying organizational causes that are more difficult to see, however, they may contribute significantly to the undesired outcome and, if not corrected, they will continue to create similar types of problems. These are root causes.
Requirements for mishap reporting and investigating all mishaps and investigations must identify the proximate causes(s), root causes(s) and contributing factor(s).
Definitions
Proximate Cause(s) (Direct Cause)• The event(s) that occurred, including any condition(s) that
existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome.
• Examples of proximate causes:
Equipment Human• Arched • Pushed incorrect button• Leaked • Fell• Over-loaded • Dropped tool• Over-heated • Connected wires
Root Cause(s)• One of multiple factors (events, conditions or organizational factors)
that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome.
Organizational factors • Any operational or management structural entity that exerts control
over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal.
• Examples: resource management (budget, staff, training); policy (content, implementation, verification); and management decisions.
Definitions
Definitions
Root Cause Analysis (RCA)
• A structured evaluation method that identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted.
• RCA is a method that helps professionals determine:
• What happened. • How it happened.• Why it happened.
• Allows learning from past problems, failures, and accidents.
Root Cause Analysis - Steps
1. Identify and clearly define the undesired outcome (outage).
2. Gather data.
3. Create a timeline.
4. Place events & conditions on an event and causal factor tree.
5. Use a fault tree or other method/tool to identify all potential causes.
6. Decompose system failures down to a basic events or conditions (Further describe what
happened)
7. Identify specific failure modes (Immediate Causes)
8. Continue asking “WHY” to identify root causes.
9. Check your logic and your facts. Eliminate items that are not causes or contributing
factors.
10. Generate solutions that address both proximate causes and root causes.
Root Cause Analysis - Steps
Clearly define the undesirable outcome. • Describe the undesired outcome. • For example: “software failed to deploy,” “transaction failed,” or
“XYZ project schedule significantly slipped.”
Gather data.Identify facts surrounding the undesired outcome.
• When did the undesired outcome occur?• Where did it occur?• What conditions were present prior to its occurrence?• What controls or barriers could have prevented its
occurrence but did not? • What are all the potential causes?• What actions can prevent recurrence?• What amelioration occurred? Did it prevent further damage?
Root Cause Analysis - Steps
Create a timeline (sequence diagram)• Illustrate the sequence of events in chronological order
horizontally across the page.
• Depict relationships between conditions, events, and exceeded or failed barriers/controls.
Exceeded- Failed Barrier
Or Control
Exceeded- Failed Barrier
Or Control
EventEventUndesiredOutcome
Condition Condition
EventEventEventEvent
Root Cause Analysis - Steps
Create a timeline (sequence diagram)• If amelioration occurred (e.g., reboot server, move application to
another server), this should be included in the evaluation to ensure that it did not contribute to the undesired outcome.
Example: In the of a server reboot, the investigation should ensure that the reboot was the result of the mishap and a result of latent hardware defects.
Exceeded- Failed Barrier
Or Control
Exceeded- Failed Barrier
Or Control
EventEventUndesiredOutcome
Condition Condition
EventEventEventEventExceeded-
FailedAmelioration
Exceeded- Failed
Amelioration
Root Cause Analysis - Steps
Example: simple timeline.
Application failed to Go Live
Application failed to Go Live
Operating system started up
Operating system started up
Lost transactions
(Penalties paid)
Tech. UsedWrong Method
To Correct
Tech. UsedWrong Method
To Correct
ServerPowered Up
ServerPowered Up
Switch port in wrong
VLAN
Switch port in wrong
VLAN
Root Cause Analysis - Steps
Create an event and causal factor tree.(A visual representation of the causes that led to the failure or mishap.)
• Place the undesired outcome at the top of the tree.
• Add all events, conditions, and exceeded/failed barriers that occurred immediately before the undesired outcome and might have caused it.
Application failed to Go Live
Application failed to Go Live
Operating system started up
Operating system started up
Technician Used Wrong
Method to Correct
Technician Used Wrong
Method to Correct
Lost transactions (Penalties paid)Lost transactions (Penalties paid)
ServerPowered Up
ServerPowered Up
Switch port in wrong VLAN
Switch port in wrong VLAN
Root Cause Analysis - StepsCreate an event and causal factor tree.• Brainstorm to ensure that all
possible causes are included, NOT just those that you are sure are involved.
• Be sure to consider people, hardware, software, policy, procedures, and the environment.
Electric power tripped
Application failed to Go Live
Application failed to Go Live
Operating system started up
Operating system started up
Technician Used WrongMethod to Correct
Technician Used WrongMethod to Correct
Lost transactions (Penalties Paid)Lost transactions (Penalties Paid)
ServerPowered Up
ServerPowered Up
Switch port in wrong VLAN
Switch port in wrong VLAN
Technicians not properly trained
Power Supply Failed
Port labeled incorrectly
Switch labeled incorrectly
NIC driver wrong
Root Cause Analysis - StepsCreate an event and causal factor
tree continued...• If you have solid data indicating
that one of the possible causes is not applicable, it can be eliminated from the tree.
Caution: Do not be too eager to eliminate early on. If there is a possibility that it is a causal factor, leave it and eliminate it later when more information is available.
Electric power tripped
Application failed to Go Live
Application failed to Go Live
Operating system started up
Operating system started up
Technician Used WrongMethod to Correct
Technician Used WrongMethod to Correct
Lost transactions (Penalties Paid)Lost transactions (Penalties Paid)
ServerPowered Up
ServerPowered Up
Switch port in wrong VLAN
Switch port in wrong VLAN
Technicians not properly trained
Power Supply Failed
Port labeled incorrectly
Switch labeled incorrectly
NIC driver wrong
X
Root Cause Analysis - StepsCreate an event and causal factor tree
continued…• You may use a fault tree to determine all
potential causes and to decompose the failure down to the “basic event” (e.g., system component level).
Electric power tripped
Application failed to Go Live
Application failed to Go Live
Technician Used WrongMethod to Correct
Technician Used WrongMethod to Correct
Lost transactions (Penalties Paid)Lost transactions (Penalties Paid)
Switch port in wrong VLAN
Switch port in wrong VLAN
Technicians not properly trained
Switch labeled incorrectly
Port labeled incorrectly
Power supply failed
NIC driver wrong
Diagram wrongMaintenance swap with no re-label
Confusing labels
Operating system started up
Operating system started up
Root Cause Analysis - StepsCreate an event and causal factor
tree continued…• A fault tree can also be used to
identify all possible types of human failures.
Didn’t PerceiveSystem Feedback
Application failed to Go Live
Application failed to Go Live
Technician Used WrongMethod to Correct
Technician Used WrongMethod to Correct
Lost transactions (Penalties paid)Lost transactions (Penalties paid)
Switch port in wrong VLAN
Switch port in wrong VLAN
Didn’t Understand System Feedback
Operation system started up
Operation system started up
Correct InterpretationIncorrect Decision
Correct Decision ButIncorrect Action
Perception Error Interpretation Error Decision-Making Error Action-Execution Error
Rule-BasedError
Knowledge-BasedError
Skill-BasedError
Root Cause Analysis - Steps
Create an event and causal factor tree continued…• After you have identified all the possible causes, ask yourself “WHY” each
may have occurred.
• Be sure to keep your questions focused on the original issue. For example “Why was the condition present?”; “Why did the event occur?”; “Why was the parameter exceeded?” or “Why did the condition fail?”
Event #2Event #2 Failed or Exceeded Barrier or Control
Failed or Exceeded Barrier or Control
Undesired OutcomeUndesired Outcome
ConditionConditionEvent #1Event #1
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
Root Cause Analysis – Steps
Continue to ask “why” until you have reached:
1. Root cause(s) - including all organizational factors that exert control over the design, fabrication, development, maintenance, operation, and disposal of the system.
2. A problem that is not correctable by IT or IT contractor.
3. Insufficient data to continue.
Root Cause Analysis- StepsThe resultant tree of questions and
answers should lead to a comprehensive picture of POTENTIAL causes for the undesired outcome
Event #2Event #2 Failed or Exceeded Barrier or Control
Failed or Exceeded Barrier or Control
Undesired OutcomeUndesired Outcome
ConditionConditionEvent #1Event #1
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY WHY WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
XWHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY Event #2 Occurred
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
WHY ConditionExisted orChanged
Check your logic with a detailed review of each potential cause. • Verify it is a contributor or cause.• If the action, deficiency, or decision in
question were corrected, eliminated or avoided, would the undesired outcome be prevented or avoided?
> If no, then eliminate it from the tree.
Root Cause Analysis- Steps
Event #2Event #2 Failed or Exceeded Barrier or Control
Failed or Exceeded Barrier or Control
Undesired OutcomeUndesired Outcome
ConditionConditionEvent #1Event #1
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY Event #1 Occurred
WHY WHY WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
XX X
XXX X X XX X
X X X X X X X
XX WHYFailed
Exceeded Barrier or Control
WHYFailed
Exceeded Barrier or Control
Create an event and causal factor tree continued…• The remaining items on the tree are the causes (or probable causes). necessary to
produce the undesired outcome.• Proximate causes are those immediately before the undesired outcome.• Intermediate causes are those between the proximate and root causes.• Root causes are organizational factors or systemic problems located at the bottom
of the tree.
Root Cause Analysis - Steps
ROOT CAUSES
PROXIMATE CAUSES
INTERMEDIATE CAUSES
Event #2Event #2 Failed or Exceeded Barrier or Control
Failed or Exceeded Barrier or Control
Undesired OutcomeUndesired Outcome
ConditionConditionEvent #1Event #1
WHY Event #1 Occurred
WHY Event #1 Occurred
WHYFailed/Exceeded Barrier or Control
WHY Event #2 Occurred
WHY Event #2 Occurred
WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY ConditionExisted or Changed
WHY ConditionExisted or Changed
WHYFailed/Exceeded Barrier or Control
Some people choose to leave contributing factors on the tree to showall factors that influenced the event.
Contributing factor: An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence.
If this is done, illustrate them differently (e.g., dotted line boxes and arrows) so that it is clear that they are not causes.
Root Cause Analysis- Steps
Contributing Factors
Event #2 Failed or Exceeded Barrier or Control
Failed or Exceeded Barrier or Control
Undesired OutcomeUndesired Outcome
ConditionConditionEvent #1Event #1
WHY Event #1 Occurred
WHY Event #1 Occurred
WHYFailed/Exceeded Barrier or Control
WHY Event #2 Occurred
WHY Event #2 Occurred
WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY
WHY WHY WHY WHY WHY WHY WHY
WHY ConditionExisted or Changed
WHY ConditionExisted or Changed
WHYFailed/Exceeded Barrier or Control
WHY WHY WHY
Incorrect server static address used
Engineer did not read correct label
VLAN assigned incorrectly
No IP connection to network
Root Cause is Much DeeperKeep Asking Why
Investigating Causes of Failures & Mishaps
Application failed to Go Live
Application failed to Go Live
Technician Used WrongMethod to Correct
Technician Used WrongMethod to Correct
Lost transaction (Penalties paid)Lost transaction (Penalties paid)
Switch port in wrong VLAN
Switch port in wrong VLAN
Operation system started up
Operation system started up
VLAN changed in unrelated move
Incorrect server static address used
Engineer did not read correct label
VLAN incorrectly assigned
No IP connection to network
Investigating Causes of Failures & Mishaps
Application failed to Go Live
Application failed to Go Live
Technician Used WrongMethod to Correct
Technician Used WrongMethod to Correct
Lost transactions (Penalties paid)Lost transactions (Penalties paid)
Switch port in wrong VLAN
Switch port in wrong VLAN
Operating system started up
Operating system started up
No Quality Inspection
Insufficient Quality Staff
Insufficient Budget
ProcedureIncorrect
Not Updated
New Task InsufficientAnomaly Training
Training Does Not Exist
Not Under Configuration Mgmt
Insufficient Training Budget
Organization Under Estimates Importance of Anomaly Training
Correct InterpretationIncorrect Decision
Decision-Making Error
Generating Recommendations:
At a minimum corrective actions should be generated to eliminate proximate causes and eliminate or mitigate the negative effects of root causes.
When multiple causes exist, there is limited budget, or it is difficult to determine what should be corrected:
• Quantitative analysis can be used to determine the total contribution of each cause to the undesirable outcome .
• Fishbone diagrams (or other methods) can be used to arrange causes in order of their importance.
• Those causes which contribute most to the undesirable outcome should be eliminated or the negative effects should be mitigated to minimize risk.
Root Cause Analysis- Steps
Cause (Causal Factor) An event or condition that results in an effect. Anything that shapes or influences the outcome.
Proximate Cause(s) The event(s) that occurred, including any condition(s) that existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome. Also known as the direct cause(s).
Root Cause(s) One of multiple factors (events, conditions or organizational factors) that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome.
Root Cause Analysis (RCA) A structured evaluation method that identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted.
Event A real-time occurrence describing one discrete action, typically an error, failure, or malfunction. Examples: pipe broke, power lost, lightning struck, person opened valve, etc…
Condition Any as-found state, whether or not resulting from an event, that may have safety, health, quality, security, operational, or environmental implications.
Organizational Factors Any operational or management structural entity that exerts control over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal.Examples: resource management (budget, staff, training); policy (content, implementation, verification); and management decisions.
Contributing Factor An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence.
Barrier A physical device or an administrative control used to reduce risk of the undesired outcome to an acceptable level. Barriers can provide physical intervention (e.g., a guardrail) or procedural separation in time and space (e.g., lock-out-tag-out procedure).
Definitions of RCA & Related Terms
MIR Process / Forms
Major Incident – Severe Business impact:• service, system or infrastructure component not functioning adequately to enable business
process• total loss of service, system or infrastructure component
Major Incidents can also be considered to be those which do not entirely impede the use of the service, system or infrastructure component such as:
• continuous slow response• general degradation of service
• Refer: http://thinkingproblemmanagement.blogspot.com