36
Root Cause Analysis

Root cause analysis

Embed Size (px)

DESCRIPTION

Root cause analysis

Citation preview

Page 1: Root cause analysis

Root Cause Analysis

Page 2: Root cause analysis

Like icebergs, most of the problem is usually below the surface!

Investigating causes of failures & mishaps

Stop and ask yourself…

Did you really find the causes of the failure?

Page 3: Root cause analysis

This is NOT Root Cause Analysis

Page 4: Root cause analysis

Technical Proficiency

• Once the accident happened how did Gene Krantz rely on the skills and expertise of his people?

• How did Lovell work to initiate actions in the spaceship? Was he able to balance that with his technical responsibilities in the craft? How did he do it?

• What steps does your unit take to maintain Technical Proficiency?

Les

sons

fro

m A

poll

0 13

Page 5: Root cause analysis

Teambuilding

• How did Lovell contribute to the group process when Mattingly wanted to practice the docking procedure again after 3 hrs of practice?

• When Krantz had the team in the classroom how did he establish the goal and then how did he go about motivating others to achieve the goal of returning the space craft safely to earth?

• Did Lovell make the right call when faced with the challenge of forcing Mattingly to stay behind because of the fear of measles?

• How does a leader successfully build a strong team, but then separate him or herself from the Team to make a critical decision?

• How’s your Team doing? Les

sons

fro

m A

poll

0 13

Page 6: Root cause analysis

Effective Communications

• Even as everything is breaking loose in Mission Control, Gene Krantz asks his team to “Work the Problem.” He then listened to the experts report in on their areas of the mission. How did his effective comms set the stage for a successful recovery?

• Krantz stated “Failure is not an option” and Lovell told his crew “I intend to go home.” By clearly stating their ideas and vision how did it direct the teams towards mission accomplishment?

• Whose the best communicator you’ve ever worked with? What made them excel? Les

sons

fro

m A

poll

0 13

Page 7: Root cause analysis

Vision Development & Implementation

• JFK’s Vision: "I believe that this nation should commit itself to achieving the goal, before this decade is out, of landing a man on the moon and returning him safely to Earth.“

• How does a stated vision focus the unit and bring the crew together?

• Lovell states; “Columbus, Lindberg, and Armstrong; it is not a miracle for man to walk on the moon, we just decided to go.”

• What’s the vision at your unit? Has everyone decided “to go?” What can your unit do to get everyone “on board”? Les

sons

fro

m A

poll

0 13

Page 8: Root cause analysis

Conflict Management

• How did Lovell deal with stress and conflict in the LEM?

• How did the CO2 challenge help the crew to overcome the conflict they were experiencing?

• Is there more or less conflict when people are busy and focused or when there is less to do and folks have time on their hands? Why?

• How did Krantz and Lovell go about alleviating conflict between the crew and the Medical team?

Les

sons

fro

m A

poll

0 13

Page 9: Root cause analysis

Decision Making & Problem Solving

• How did the Team live the Competency of Decision Making and Problem Solving in working the “Power” problem to conclusion?

• Right after the explosion Krantz’s asks Mission Control “What do we have on the Space Craft that’s good?”

• Why did he ask this question?

• How did it aid in making the correct decision to shut down the fuel cells?

• Does everyone at your Teamt ensure that the Decision Makers have all the available and correct information? Why or Why not?

Les

sons

fro

m A

poll

0 13

Page 10: Root cause analysis

Creativity and Innovation

• We’ve discussed a lot of positive leadership qualities during this session. How did Gene Krantz create an environment with his Mission Control team to ensure they were able to figure out how to solve the CO2 problem with a “Square Peg in a Round Hole!”

• Lovell states at the end of the movie; “Thousands of people worked to bring the 3 of us back home.” How did creativity and innovation make the “Successful Failure” a reality?

• How does your unit build on Lessons Learned?

Les

sons

fro

m A

poll

0 13

Page 11: Root cause analysis

Apollo 13

Questions on homework

Page 12: Root cause analysis

Investigating causes of failures & mishaps

When performing an investigation, it is necessary to look at more than just the immediately visible cause, which is often the proximate cause.

There are underlying organizational causes that are more difficult to see, however, they may contribute significantly to the undesired outcome and, if not corrected, they will continue to create similar types of problems. These are root causes.

Requirements for mishap reporting and investigating all mishaps and investigations must identify the proximate causes(s), root causes(s) and contributing factor(s).

Page 13: Root cause analysis

Definitions

 Proximate Cause(s) (Direct Cause)• The event(s) that occurred, including any condition(s) that

existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome.

• Examples of proximate causes:

Equipment Human• Arched • Pushed incorrect button• Leaked • Fell• Over-loaded • Dropped tool• Over-heated • Connected wires

Page 14: Root cause analysis

Root Cause(s)• One of multiple factors (events, conditions or organizational factors)

that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome.

Organizational factors • Any operational or management structural entity that exerts control

over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal.

• Examples: resource management (budget, staff, training); policy (content, implementation, verification); and management decisions.

Definitions

Page 15: Root cause analysis

Definitions

Root Cause Analysis (RCA)

• A structured evaluation method that identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted.

• RCA is a method that helps professionals determine:

• What happened. • How it happened.• Why it happened.

• Allows learning from past problems, failures, and accidents.

Page 16: Root cause analysis

Root Cause Analysis - Steps

1. Identify and clearly define the undesired outcome (outage).

2. Gather data.

3. Create a timeline.

4. Place events & conditions on an event and causal factor tree.

5. Use a fault tree or other method/tool to identify all potential causes.

6. Decompose system failures down to a basic events or conditions (Further describe what

happened)

7. Identify specific failure modes (Immediate Causes)

8. Continue asking “WHY” to identify root causes.

9. Check your logic and your facts. Eliminate items that are not causes or contributing

factors.

10. Generate solutions that address both proximate causes and root causes.

Page 17: Root cause analysis

Root Cause Analysis - Steps

Clearly define the undesirable outcome. • Describe the undesired outcome. • For example: “software failed to deploy,” “transaction failed,” or

“XYZ project schedule significantly slipped.”

Gather data.Identify facts surrounding the undesired outcome.

• When did the undesired outcome occur?• Where did it occur?• What conditions were present prior to its occurrence?• What controls or barriers could have prevented its

occurrence but did not? • What are all the potential causes?• What actions can prevent recurrence?• What amelioration occurred? Did it prevent further damage?

Page 18: Root cause analysis

Root Cause Analysis - Steps

Create a timeline (sequence diagram)• Illustrate the sequence of events in chronological order

horizontally across the page.

• Depict relationships between conditions, events, and exceeded or failed barriers/controls.

Exceeded- Failed Barrier

Or Control

Exceeded- Failed Barrier

Or Control

EventEventUndesiredOutcome

Condition Condition

EventEventEventEvent

Page 19: Root cause analysis

Root Cause Analysis - Steps

Create a timeline (sequence diagram)• If amelioration occurred (e.g., reboot server, move application to

another server), this should be included in the evaluation to ensure that it did not contribute to the undesired outcome.

Example: In the of a server reboot, the investigation should ensure that the reboot was the result of the mishap and a result of latent hardware defects.

Exceeded- Failed Barrier

Or Control

Exceeded- Failed Barrier

Or Control

EventEventUndesiredOutcome

Condition Condition

EventEventEventEventExceeded-

FailedAmelioration

Exceeded- Failed

Amelioration

Page 20: Root cause analysis

Root Cause Analysis - Steps

Example: simple timeline.

Application failed to Go Live

Application failed to Go Live

Operating system started up

Operating system started up

Lost transactions

(Penalties paid)

Tech. UsedWrong Method

To Correct

Tech. UsedWrong Method

To Correct

ServerPowered Up

ServerPowered Up

Switch port in wrong

VLAN

Switch port in wrong

VLAN

Page 21: Root cause analysis

Root Cause Analysis - Steps

Create an event and causal factor tree.(A visual representation of the causes that led to the failure or mishap.)

• Place the undesired outcome at the top of the tree.

• Add all events, conditions, and exceeded/failed barriers that occurred immediately before the undesired outcome and might have caused it.

Application failed to Go Live

Application failed to Go Live

Operating system started up

Operating system started up

Technician Used Wrong

Method to Correct

Technician Used Wrong

Method to Correct

Lost transactions (Penalties paid)Lost transactions (Penalties paid)

ServerPowered Up

ServerPowered Up

Switch port in wrong VLAN

Switch port in wrong VLAN

Page 22: Root cause analysis

Root Cause Analysis - StepsCreate an event and causal factor tree.• Brainstorm to ensure that all

possible causes are included, NOT just those that you are sure are involved.

• Be sure to consider people, hardware, software, policy, procedures, and the environment.

Electric power tripped

Application failed to Go Live

Application failed to Go Live

Operating system started up

Operating system started up

Technician Used WrongMethod to Correct

Technician Used WrongMethod to Correct

Lost transactions (Penalties Paid)Lost transactions (Penalties Paid)

ServerPowered Up

ServerPowered Up

Switch port in wrong VLAN

Switch port in wrong VLAN

Technicians not properly trained

Power Supply Failed

Port labeled incorrectly

Switch labeled incorrectly

NIC driver wrong

Page 23: Root cause analysis

Root Cause Analysis - StepsCreate an event and causal factor

tree continued...• If you have solid data indicating

that one of the possible causes is not applicable, it can be eliminated from the tree.

Caution: Do not be too eager to eliminate early on. If there is a possibility that it is a causal factor, leave it and eliminate it later when more information is available.

Electric power tripped

Application failed to Go Live

Application failed to Go Live

Operating system started up

Operating system started up

Technician Used WrongMethod to Correct

Technician Used WrongMethod to Correct

Lost transactions (Penalties Paid)Lost transactions (Penalties Paid)

ServerPowered Up

ServerPowered Up

Switch port in wrong VLAN

Switch port in wrong VLAN

Technicians not properly trained

Power Supply Failed

Port labeled incorrectly

Switch labeled incorrectly

NIC driver wrong

X

Page 24: Root cause analysis

Root Cause Analysis - StepsCreate an event and causal factor tree

continued…• You may use a fault tree to determine all

potential causes and to decompose the failure down to the “basic event” (e.g., system component level).

Electric power tripped

Application failed to Go Live

Application failed to Go Live

Technician Used WrongMethod to Correct

Technician Used WrongMethod to Correct

Lost transactions (Penalties Paid)Lost transactions (Penalties Paid)

Switch port in wrong VLAN

Switch port in wrong VLAN

Technicians not properly trained

Switch labeled incorrectly

Port labeled incorrectly

Power supply failed

NIC driver wrong

Diagram wrongMaintenance swap with no re-label

Confusing labels

Operating system started up

Operating system started up

Page 25: Root cause analysis

Root Cause Analysis - StepsCreate an event and causal factor

tree continued…• A fault tree can also be used to

identify all possible types of human failures.

Didn’t PerceiveSystem Feedback

Application failed to Go Live

Application failed to Go Live

Technician Used WrongMethod to Correct

Technician Used WrongMethod to Correct

Lost transactions (Penalties paid)Lost transactions (Penalties paid)

Switch port in wrong VLAN

Switch port in wrong VLAN

Didn’t Understand System Feedback

Operation system started up

Operation system started up

Correct InterpretationIncorrect Decision

Correct Decision ButIncorrect Action

Perception Error Interpretation Error Decision-Making Error Action-Execution Error

Rule-BasedError

Knowledge-BasedError

Skill-BasedError

Page 26: Root cause analysis

Root Cause Analysis - Steps

Create an event and causal factor tree continued…• After you have identified all the possible causes, ask yourself “WHY” each

may have occurred.

• Be sure to keep your questions focused on the original issue. For example “Why was the condition present?”; “Why did the event occur?”; “Why was the parameter exceeded?” or “Why did the condition fail?”

Event #2Event #2 Failed or Exceeded Barrier or Control

Failed or Exceeded Barrier or Control

Undesired OutcomeUndesired Outcome

ConditionConditionEvent #1Event #1

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

Page 27: Root cause analysis

Root Cause Analysis – Steps

Continue to ask “why” until you have reached:

1. Root cause(s) - including all organizational factors that exert control over the design, fabrication, development, maintenance, operation, and disposal of the system.

2. A problem that is not correctable by IT or IT contractor.

3. Insufficient data to continue.

Page 28: Root cause analysis

Root Cause Analysis- StepsThe resultant tree of questions and

answers should lead to a comprehensive picture of POTENTIAL causes for the undesired outcome

Event #2Event #2 Failed or Exceeded Barrier or Control

Failed or Exceeded Barrier or Control

Undesired OutcomeUndesired Outcome

ConditionConditionEvent #1Event #1

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY WHY WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

XWHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

Page 29: Root cause analysis

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY Event #2 Occurred

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

WHY ConditionExisted orChanged

Check your logic with a detailed review of each potential cause. • Verify it is a contributor or cause.• If the action, deficiency, or decision in

question were corrected, eliminated or avoided, would the undesired outcome be prevented or avoided?

> If no, then eliminate it from the tree.

Root Cause Analysis- Steps

Event #2Event #2 Failed or Exceeded Barrier or Control

Failed or Exceeded Barrier or Control

Undesired OutcomeUndesired Outcome

ConditionConditionEvent #1Event #1

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY Event #1 Occurred

WHY WHY WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

XX X

XXX X X XX X

X X X X X X X

XX WHYFailed

Exceeded Barrier or Control

WHYFailed

Exceeded Barrier or Control

Page 30: Root cause analysis

Create an event and causal factor tree continued…• The remaining items on the tree are the causes (or probable causes). necessary to

produce the undesired outcome.• Proximate causes are those immediately before the undesired outcome.• Intermediate causes are those between the proximate and root causes.• Root causes are organizational factors or systemic problems located at the bottom

of the tree.

Root Cause Analysis - Steps

ROOT CAUSES

PROXIMATE CAUSES

INTERMEDIATE CAUSES

Event #2Event #2 Failed or Exceeded Barrier or Control

Failed or Exceeded Barrier or Control

Undesired OutcomeUndesired Outcome

ConditionConditionEvent #1Event #1

WHY Event #1 Occurred

WHY Event #1 Occurred

WHYFailed/Exceeded Barrier or Control

WHY Event #2 Occurred

WHY Event #2 Occurred

WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

WHY ConditionExisted or Changed

WHY ConditionExisted or Changed

WHYFailed/Exceeded Barrier or Control

Page 31: Root cause analysis

Some people choose to leave contributing factors on the tree to showall factors that influenced the event.

Contributing factor: An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence.

If this is done, illustrate them differently (e.g., dotted line boxes and arrows) so that it is clear that they are not causes.

Root Cause Analysis- Steps

Contributing Factors

Event #2 Failed or Exceeded Barrier or Control

Failed or Exceeded Barrier or Control

Undesired OutcomeUndesired Outcome

ConditionConditionEvent #1Event #1

WHY Event #1 Occurred

WHY Event #1 Occurred

WHYFailed/Exceeded Barrier or Control

WHY Event #2 Occurred

WHY Event #2 Occurred

WHYWHY WHY WHY WHY WHY WHY WHY WHY WHY WHY WHY

WHY WHY WHY WHY WHY WHY WHY

WHY ConditionExisted or Changed

WHY ConditionExisted or Changed

WHYFailed/Exceeded Barrier or Control

WHY WHY WHY

Page 32: Root cause analysis

Incorrect server static address used

Engineer did not read correct label

VLAN assigned incorrectly

No IP connection to network

Root Cause is Much DeeperKeep Asking Why

Investigating Causes of Failures & Mishaps

Application failed to Go Live

Application failed to Go Live

Technician Used WrongMethod to Correct

Technician Used WrongMethod to Correct

Lost transaction (Penalties paid)Lost transaction (Penalties paid)

Switch port in wrong VLAN

Switch port in wrong VLAN

Operation system started up

Operation system started up

Page 33: Root cause analysis

VLAN changed in unrelated move

Incorrect server static address used

Engineer did not read correct label

VLAN incorrectly assigned

No IP connection to network

Investigating Causes of Failures & Mishaps

Application failed to Go Live

Application failed to Go Live

Technician Used WrongMethod to Correct

Technician Used WrongMethod to Correct

Lost transactions (Penalties paid)Lost transactions (Penalties paid)

Switch port in wrong VLAN

Switch port in wrong VLAN

Operating system started up

Operating system started up

No Quality Inspection

Insufficient Quality Staff

Insufficient Budget

ProcedureIncorrect

Not Updated

New Task InsufficientAnomaly Training

Training Does Not Exist

Not Under Configuration Mgmt

Insufficient Training Budget

Organization Under Estimates Importance of Anomaly Training

Correct InterpretationIncorrect Decision

Decision-Making Error

Page 34: Root cause analysis

Generating Recommendations:

At a minimum corrective actions should be generated to eliminate proximate causes and eliminate or mitigate the negative effects of root causes.

When multiple causes exist, there is limited budget, or it is difficult to determine what should be corrected:

• Quantitative analysis can be used to determine the total contribution of each cause to the undesirable outcome .

• Fishbone diagrams (or other methods) can be used to arrange causes in order of their importance.

• Those causes which contribute most to the undesirable outcome should be eliminated or the negative effects should be mitigated to minimize risk.

Root Cause Analysis- Steps

Page 35: Root cause analysis

Cause (Causal Factor) An event or condition that results in an effect. Anything that shapes or influences the outcome.

Proximate Cause(s) The event(s) that occurred, including any condition(s) that existed immediately before the undesired outcome, directly resulted in its occurrence and, if eliminated or modified, would have prevented the undesired outcome. Also known as the direct cause(s).

Root Cause(s) One of multiple factors (events, conditions or organizational factors) that contributed to or created the proximate cause and subsequent undesired outcome and, if eliminated, or modified would have prevented the undesired outcome. Typically multiple root causes contribute to an undesired outcome.

Root Cause Analysis (RCA) A structured evaluation method that identifies the root causes for an undesired outcome and the actions adequate to prevent recurrence. Root cause analysis should continue until organizational factors have been identified, or until data are exhausted.

Event A real-time occurrence describing one discrete action, typically an error, failure, or malfunction. Examples: pipe broke, power lost, lightning struck, person opened valve, etc…

Condition Any as-found state, whether or not resulting from an event, that may have safety, health, quality, security, operational, or environmental implications.

Organizational Factors Any operational or management structural entity that exerts control over the system at any stage in its life cycle, including but not limited to the system’s concept development, design, fabrication, test, maintenance, operation, and disposal.Examples: resource management (budget, staff, training); policy (content, implementation, verification); and management decisions.

Contributing Factor An event or condition that may have contributed to the occurrence of an undesired outcome but, if eliminated or modified, would not by itself have prevented the occurrence.

Barrier A physical device or an administrative control used to reduce risk of the undesired outcome to an acceptable level. Barriers can provide physical intervention (e.g., a guardrail) or procedural separation in time and space (e.g., lock-out-tag-out procedure).

Definitions of RCA & Related Terms

Page 36: Root cause analysis

MIR Process / Forms

Major Incident – Severe Business impact:• service, system or infrastructure component not functioning adequately to enable business

process• total loss of service, system or infrastructure component

Major Incidents can also be considered to be those which do not entirely impede the use of the service, system or infrastructure component such as:

• continuous slow response• general degradation of service

• Refer: http://thinkingproblemmanagement.blogspot.com