Accident investigation course

Embed Size (px)

Citation preview

  • 1. Accident and Incident Investigation1/210

2. Objectives of this Section To define the reasons for investigating accident and incidents. To outline the process for effectively investigating accidents and incidents. To facilitate an effective investigation.2/210 3. Accident Investigation Important part of any safety management system. Highlights the reasons why accidents occur and how to prevent them. The primary purpose of accident investigations is to improve health and safety performance by: Exploring the reasons for the event and identifying both the immediate and underlying causes; Identifying remedies to improve the health and safety management system by improving risk control, preventing a recurrence and reducing financial losses.3/210 4. What to Investigate? All accidents whether major or minor are caused. Serious accidents have the same root causes as minor accidents as do incidents with a potential for serious loss. It is these root causes that bring about the accident, the severity is often a matter of chance. Accident studies have shown that there is a consistently greater number of less serious accidents than serious accidents and in the same way a greater number of incidents then accidents. 4/210 5. Many accident ratio studies have been undertaken and the one shown below is based on studies carried out by the Health & Safety Executive.1Major injury Or illness7 Minor injuries or illnesses189 Non Injury Accidents/Illnesses5/210 6. Accident Studies In all cases the non injury incidents had the potential to become events with more serious consequences. Such ratios clearly demonstrate that safety effort should be aimed at all accidents including unsafe practices at the bottom of the pyramid, with a resulting improvement in upper tiers. Peterson (1978) in defining the principles of safety management says that an unsafe act, an unsafe condition, an accident are symptoms of something wrong within the managements system.6/210 7. Accident Studies All events represent a degree of failure in control and are potential learning experiences. It therefore follows that all accidents should be investigated to some extent. This extent should be determined by the loss potential, rather then just the immediate effect.7/210 8. Stages in an Accident/Incident Investigation The stages in an accident/incident investigation are shown in the following diagram. Deal with immediate risks.Select the level of investigation.Investigate the event.Record and analyse the results.Review the process.8/210 9. Dealing with Immediate Risks Deal with immediate risks.Select the level of investigation.Make the situation safe and prevent further injury. Help, treat and if necessary rescue injured persons.Investigate the event.Record and analyse the results.Review the process.When accidents and incidents occur immediate action may be necessary to:An effective response can only be made if it has been planned for in advance. 9/210 10. Day 2 start10/210 11. Selecting the level of investigation The greatest effort should be put into: Deal with immediate risks.Select the level of investigation.Investigate the event.Record and analyse the results.Review the process. Those involving severe injuries, illhealth or loss. Those which could have caused much greater harm or damage.These types of accidents and incidents demand more careful investigation and management time. This can usually be achieved by: Looking more closely at the underlying causes of significant events. Assigning the responsibility for the investigation of more significant events to more senior managers. 11/210 12. Investigating the Event Deal with immediate risks.Select the level of investigation.The purpose of investigations is to establish: Investigate the event. Record and analyse the results.Review the process.The way things were and how they came to be. What happened the sequence of events that led to the outcome. Why things happened as they did analysing both the immediate and underlying causes. What needs to be done to avoid a repetition and how this can be achieved.12/210 13. A few sources should give the investigator all that is needed to know.Observation Information from physical sources including: Premises and place of work Access & egress Plant & substances in use Location & relationship of physical particles Any post event checks, sampling or reconstructionDocuments Information from: Written instructions; Procedures, risk assessments, policies Records of earlier inspections, tests, examinations and surveys. Checking reliability, accuracy Identifying conflicts and resolving differences Identifying gaps in evidenceInterviews Information from: Those involved and their line management; Witnesses; Those observed or involved prior to the event e.g. inspection & maintenance staff.13/210 14. Interviews Interviewing the person(s) involved and witnesses to the accident is of prime importance, ideally in familiar surroundings so as not to make the person uncomfortable. The interview style is important with emphasis on prevention rather than blame. The person(s) should give an account of what happened in their terms rather than the investigators. 14/210 15. Interviews Interviews should be separate to stop people from influencing each other. Questions when asked should not be intimidating as the investigator will be seen as aggressive and reflecting a blame culture.15/210 16. Observation The accident site should be inspected as soon as possible after the accident. Particular attention should/must be given to: Positions of people. Personnel protective equipment (PPE). Tools and equipment, plant or substances in use. Orderliness/Tidiness. 16/210 17. Documents Documentation to be looked at includes: Written instructions, procedures and risk assessments which should have been in operation and followed. The validity of these documents may need to be checked by interview. The main points to look for are: Are they adequate/satisfactory? Were they followed on this occasion? Were people trained/competent to follow it?Records of inspections, tests, examination and surveys undertaken before the event. These provide information on how and why the circumstances leading to the event arose. 17/210 18. Determining Causes Collect all information and facts which surround the accident. Immediate causes are obvious and easy to find. They are brought about by unsafe acts and conditions and are the ACTIVE FAILURES. Unsafe acts show poor safety attitudes and indicate a lack of proper training. These unsafe acts and conditions are brought about by the so called root causes. These are the LATENT FAILURES and are brought about by failures in organisation and the managements safety system. 18/210 19. Determine what changes are needed The investigation should determine what control measures were absent, inadequate or not implemented and so generate remedial action for implementation to correct this.19/210 20. Generally, remedial actions should follow the hierarchy of risk control: Eliminate Risks by substituting the dangerous by the inherently less dangerous. Combat risks at source by engineering controls and giving collective protective measures priority. Minimise risk by designing suitable systems of working. Use PPE as a last resort.20/210 21. Day 3 start21/210 22. Recording & Analysing the Results Deal with immediate risks. Select the level of investigation. Investigate the event. Record and analyse the results. Review the process. Recorded in a similar and systematic manner. Provides a historical record of the accident. Analysis of the causes and recommended preventative protective measures should be listed. Completed as soon after the accident as possible. Information on the accident and remedial actions should be passed to all supervisors. Appropriate preventative measures may also have to be implemented by such supervisors. Investigation reports and accident statistics should be analysed from time to time to identify common causes, features and trends not be apparent from looking at events in isolation. 22/210 23. Reviewing the Process Deal with immediate risks.Select the level of investigation.Investigate the event.Record and analyse the results.Review the process.Reviewing the accident/incident investigation process should consider: The results of investigations and analysis. The operation of the investigation system (in terms of quality and effectiveness).Line managers should follow through and action the findings of investigations and analysis. Follow up systems should be established where necessary to keep progress under control. 23/210 24. The investigation system should be examined from time to time to check that it consistently delivers information in accordance with the stated objectives and standards. This usually requires: Checking samples of investigation forms to verify the standard of investigation and the judgements made about causation and prioritisation of remedial actions. Checking the numbers of incidents, near misses, injury and ill-health events; Checking that all events are being reported. 24/210 25. What is your definition of an Accident?25/210 26. What is an Accident - an unplanned event - an unplanned incident involving injury or fatality - a series of events culminating in an unplanned and unforeseen event 26/210 27. How do Accidents occur? - Accidents(with or without injuries) occur when a series of unrelated events coincide at a certain time and space. -This can be from a few events to a series of a dozen or more (Because the coincidence of the series of events is a matter of luck, actual accidents only happen infrequently) 27/210 28. Unsafe Acts - An unsafe act occurs in approx 85%- 95% of all analyzed accidents with injuries - An unsafe act is usually the last of a series of events before the accident occurs (it could occur at any step of the event) - By stopping or eliminating the unsafe act, we can stop the accident from occurring28/210 29. What is an Accident Investigation? A systematic approach to the identification of causal factors and implementation of corrective actions without placing blame on or finding personal fault. The information collected during an investigation is essential to determine trends and taking appropriate steps to prevent future accidents.29/210 30. Which Accidents should be Recorded or Reported? ALL accidents (including illnesses) shall be recorded and reported through the established procedures and guidance30/210 31. Why Investigate Accidents? Determine the cause Develop and implement corrective actions Document the events Meet legal requirementsPrimary Focus: PREVENT REOCCURENCE!!! PREVENT REOCCURENCE!!! PREVENT REOCCURENCE!!! 31/210 32. Accident vs. Near-Miss Accident : Any undesired, unplanned event arising out of a given work-related task which results in physical injury/ illness or damage to property.Near-Miss : Events which did not result in injury/illness or damage but had the potential to do so.32/210 33. Accident Ratio Study 1 1030 600 6000Serious or DisablingMinor InjuriesProperty Damage Accidents with no visible injury or damage Unsafe Acts or Conditions 33/210 34. Accident Causes Unsafe Act - an act by the injured person or another person (or both) which caused the accident, and/or Unsafe Condition - some environmental or hazardous situation which caused the accident independent of the employee 34/210 35. Accident Causation Model Results of the accident - physical harm - property damageIncident Occurrence - contact with - typeImmediate causes - practices - conditionsBasic causes - personal factors - job factors - supervisory performance - management policy and decisions 35/210 36. Results of the Accident Physical Harm - catastrophic (multiple deaths) - single death - disabling - serious - minor Property Damage - catastrophic - major - serious - minor36/210 37. Incident Occurrence Type - struck by - struck against - slip, trip - fell from - caught on - fell on same level - caught in - overexertion Contact with - electricity - noise - hazmat - radiation- equipment - vibration - heat/cold - animals/insects 37/210 38. Immediate Causes Practices - operating without authority - use equipment improperly - not using PPE when required - correct lifting procedures not established - drinking or drug use - horseplay - equipment not properly secured38/210 39. Immediate Causes (contd) Conditions - ineffective guards - unserviceable tools and equipment - inadequate warning systems - bad housekeeping practices - poor work space illumination - unhealthy work environment 39/210 40. Basic Causes Personal Factors - lack of knowledge or skill - improper motivation - physical or mental condition - literacy or ability Job Factors - Physical environment - sub-standard equipment - abnormal usage - wear and tear - inadequate standards - design and maintenance 40/210 41. Basic Causes (contd) Supervisory Performance - inadequate instructions - failure of SOPs - rules not enforced - hazards not corrected - devices not provided Management Policy and Decisions - set measurable standards - measure work in progress - evaluate work vs. standards - correct performanceNo animals were hurt as a result of this accident 41/210 42. Severity of Incident Major - Employee fatality, - Hospitalization of 3 or more employees, - Permanent employee disability, - Five or more lost workdays, - Conditions that could pose an imminent and threat of serious injury/illness to other employees - Property losses in excess of $1 Million Minor - All other (less serious) incidents and unsafe conditions reported by employees 42/210 43. Who Investigates? Major Accidents - NOAA GO TEAM Investigation Team - LO Representative - Other agencies such as NTSB, USCG, OSHA Minor Accidents - First-Line Supervisor - Site Director or Manager - Site Safety Representative - NOAA SECO (if needed) 43/210 44. Investigators Qualifications Technical knowledge Objectivity Analytical approach Familiarity with the job, process or operation Tact in communicating Intellectual honesty Inquisitiveness and curiosity 44/210 45. When to Investigate? Immediately after incident Witness memories fade Equipment and clues are movedFinish investigation quickly45/210 46. What to Investigate? All accidents and near-misses - Conduct investigation upon first notification - Keeping the scene in-tact and recording witnesses statements early is key to a successful investigation 46/210 47. Accident Investigation Kit May Include: Digital Camera Report forms, clipboard, pens Barricade tape Flashlight Tape measure Tape recorder Personal Protective Equipment (as appropriate) 47/210 48. The Accident Occurs Employee or co-worker immediately reports the accident to a supervisor Supervisor secures/assesses the scene to prevent additional injuries to other employees, before assisting the injured employee Supervisor treats the injury or seeks medical treatment for the injured The accident scene is left intact Site safety rep is contacted to assist the supervisor in the investigation of the accident. 48/210 49. Beginning the Investigation Gather investigation members and kit Report to the scene Look at the big picture Record initial observations Take pictures 49/210 50. Whats Involved? Who was injured? Medication, drugs, or alcohol? Was employee ill or fatigued? Environmental conditions? 50/210 51. Witnesses Who witnessed the accident? Was a supervisor or Team Lead nearby? Where were other employees? Why didnt anyone witness the accident (working alone, remote areas)? 51/210 52. Interviewing Tips Discuss what happened leading up to and after the accident Encourage witnesses to describe the accident in their own words Dont be defensive or judgmental Use open-ended questions Do not interrupt the witness 52/210 53. What was Involved? Machine, tool, or equipment Chemicals Environmental conditions Field season prep operations53/210 54. Time of Accident Date and time? Normal shift or working hours? Employee coming off a vacation? 54/210 55. Accident Location Work area On, under, in, near Off-site address Doing normal job duties Performing nonroutine or routine tasks (i.e., properly trained) 55/210 56. Employees Activity Motion conducted at time of accident Repetitive motion? Type of material being handled56/210 57. Accident Narrative Describe the details so the reader can clearly picture the accident Specific body parts affected Specific motions of injured employee just before, during, and after accident 57/210 58. Causal Factors Try not to accept single cause theory Identify underlying causes (root) Primary cause Secondary causes Contributing causes Effects58/210 59. Corrective Actions Taken Include immediate interim controls implemented at the time of accident Recommended corrective actions Employee training Preventive maintenance activities Better operating procedures Hazard recognition (ORM) Management awareness of risks involved59/210 60. Immediate Notification Supervisor shall complete the NOAA Web Based Accident/ Illness Report Form and submit within 24 hours of incident occurrence (8 hours for major incidents).60/210 61. Accident Analysis Summary Investigate accident immediately Determine who was involved and who witnessed it Ascertain what items or equipment were involved Record detailed description Determine causal factors Implement corrective actions 61/210 62. 62/210 63. 63/210 64. 1.What is an Accident Investigation? a.b. c. d.A systematic approach to the identification of causal factors and implementation of corrective actions. Finding personal fault and placing blame. The appropriate steps to prevent future actions. The essential step to determine trends and taking action against person or persons at fault.64/210 65. 2.Which Accidents should be Recorded or Reported? a. b.c.d.Only on the job accidents. ALL accidents (including illnesses) shall be recorded and reported. Only on the job accidents on illnesses that occur on the job and reported within 8 hours. All accidents shall be recorded and reported. 65/210 66. 3.Why Investigate Accidents? a. b. c.d.To develop and implement corrective actions. To document the events. The Primary Focus is to PREVENT REOCCURENCE!!! To determine the cause.66/210 67. 4.Accident vs. Near-Miss? a.b.c.Any unplanned event arising out of work that resulted in injury vs. Any event which did not result in injury but had potential to do so. Any unsafe work habit vs. Any Hazardous working conditions. Any event which warns us of a problem vs. Any circumstances that result in injury or property damage. 67/210 68. 5.Which of the following are the basic areas that are looked at in an Accident Investigation. a. b. c. d.Policies. Equipment. Training. All of the above.68/210 69. Accident InvestigationAccident analysis is carried out in order to determine the cause or causes of an accident or series of accidents so as to prevent further incidents of a similar kind. It is also known as accident investigation.69/210 70. Accident InvestigationIt may be performed by a range of experts, including forensic scientists, forensic engineers or health and safety advisers. Accident investigators, particularly those in the aircraft industry, are colloquially known as "tin-kickers".70/210 71. SequenceAccident analysis is performed in four steps: Fact gathering: After an accident happened a forensic process starts to gather all possibly relevant facts that may contribute to understanding the accident.71/210 72. Sequence Fact Analysis:After the forensic process has been completed or at least delivered some results, the facts are put together to give a "big picture." The history of the accident is reconstructed and checked for consistency and plausibility.72/210 73. Sequence Conclusion Drawing:If the accident history is sufficiently informative, conclusions can be drawn about causation and contributing factors.73/210 74. Sequence Counter-measures:In some cases the development of countermeasures is desired or recommendations have to be issued to prevent further accidents of the same kind.74/210 75. MethodsThere exist numerous forms of Accident Analysis methods. These can be divided into three categories:75/210 76. Methods Causal Analysis Causal Analysis uses the principle of causality to determine the course of events. Though people casually speak of a "chain of events", results from Causal Analysis usually have the form of directed a-cyclic graphs-the nodes being events and the edges the causeeffect relations. Methods of Causal Analysis differ in their respective notion of causation. 76/210 77. Methods Expert Analysis Expert Analysis relies on the knowledge and experience of field experts. This form of analysis usually lacks a rigorous (formal/semiformal) methodological approach. This usually affects falsify-ability and objectivity of analyses. This is of importance when conclusions are heavily disputed among experts. 77/210 78. Methods Organizational AnalysisOrganizational Analysis relies on systemic theories of organization. Most theories imply that if a system's behaviour stayed within the bounds of the ideal organization then no accidents can occur.78/210 79. Methods Organizational AnalysisOrganizational Analysis can be falsified and results from analyses can be checked for objectivity. Choosing an organizational theory for accident analysis comes from the assumption that the system to be analysed conforms to that theory.79/210 80. Using Digital Photographs to Extract Evidence Once all available data has been collected by accident scene investigators and law enforcement officers, camera matching, photogrammetry or rectification can be used to determine the exact location of physical evidence shown in the accident scene photos.80/210 81. Camera matching:Camera matching uses accident scene photos that show various points of evidence. The technique uses CAD software to create a 3-dimensional model of the accident site and roadway surface.81/210 82. Camera matching: All survey data and photos are then imported into a three dimensional software package like 3D Studio Max. A virtual camera can be then be positioned relative to the 3D roadway surface. Physical evidence is then mapped from the photos onto the 3D roadway to create a three dimensional accident scene drawing. 82/210 83. PhotogrammetryPhotogrammetry is used to determine the three-dimensional geometry of an object on the accident scene from the original two dimensional photos.83/210 84. PhotogrammetryThe photographs can be used to extract evidence that may be lost after the accident is cleared. Photographs from several viewpoints are imported into software like PhotoModeler.84/210 85. PhotogrammetryThe forensic engineer can then choose points common to each photo. The software will calculate the location of each point in a three dimensional coordinate system.85/210 86. RectificationPhotographic rectification is also used to analyze evidence that may not have been measured at the accident scene. Two dimensional rectification transforms a single photograph into a top-down view. Software like PC-Rect can be used to rectify a digital photograph.86/210 87. Failure mode and effects analysis87/210 88. Failure mode and effects analysisFailure Mode and Effects Analysis (FMEA) was one of the first systematic techniques for failure analysis. It was developed by reliability engineers in the 1950s to study problems that might arise from malfunctions of military systems. 88/210 89. Failure mode and effects analysisA FMEA is often the first step of a system reliability study. It involves reviewing as many components, assemblies, and subsystems as possible to identify failure modes, and their causes and effects.89/210 90. Failure mode and effects analysisFor each component, the failure modes and their resulting effects on the rest of the system are recorded in a specific FMEA worksheet. There are numerous variations of such worksheets. A FMEA is mainly a qualitative analysis. 90/210 91. Failure mode and effects analysisA few different types of FMEA analysis exist, like Functional, Design, and Process FMEA.91/210 92. Failure mode and effects analysisSometimes the FMEA is called FMECA to indicate that Criticality analysis is performed also.92/210 93. Failure mode and effects analysisAn FMEA is an inductive reasoning (forward logic) single point of failure analysis and is a core task in reliability engineering, safety engineering and quality engineering. Quality engineering is specially concerned with the "Process" (Manufacturing and Assembly) type of FMEA. 93/210 94. Failure mode and effects analysisA successful FMEA activity helps to identify potential failure modes based on experience with similar products and processes - or based on common physics of failure logic.94/210 95. Failure mode and effects analysisIt is widely used in development and manufacturing industries in various phases of the product life cycle. Effects analysis refers to studying the consequences of those failures on different system levels. 95/210 96. Failure mode and effects analysisFunctional analyses are needed as an input to determine correct failure modes, at all system levels, both for functional FMEA or Piece-Part (hardware) FMEA.96/210 97. Failure mode and effects analysisA FMEA is used to structure Mitigation for Risk reduction based on either failure (mode) effect severity reduction or based on lowering the probability of failure or both.97/210 98. Failure mode and effects analysisThe FMEA is in principle a full inductive (forward logic) analysis, however the failure probability can only be estimated or reduced by understanding the failure mechanism.98/210 99. Failure mode and effects analysisIdeally this probability shall be lowered to "impossible to occur" by eliminating the (root) causes. It is therefore important to include in the FMEA an appropriate depth of information on the causes of failure (deductive analysis).99/210 100. Failure mode and effects analysisThe FME(C)A is a design tool used to systematically analyze postulated component failures and identify the resultant effects on system operations. The analysis is sometimes characterized as consisting of two sub-analyses, the first being the failure modes and effects analysis (FMEA), and the second, the criticality analysis (CA). 100/210 101. Failure mode and effects analysisSuccessful development of an FMEA requires that the analyst include all significant failure modes for each contributing element or part in the system. FMEAs can be performed at the system, subsystem, assembly, subassembly or part level.101/210 102. Failure mode and effects analysisThe FMECA should be a living document during development of a hardware design. It should be scheduled and completed concurrently with the design. If completed in a timely manner, the FMECA can help guide design decisions. The usefulness of the FMECA as a design tool and in the decision making process is dependent on the effectiveness and timeliness with which design problems are identified. 102/210 103. Failure mode and effects analysisTimeliness is probably the most important consideration. In the extreme case, the FMECA would be of little value to the design decision process if the analysis is performed after the hardware is built.103/210 104. Failure mode and effects analysisWhile the FMECA identifies all part failure modes, its primary benefit is the early identification of all critical and catastrophic subsystem or system failure modes so they can be eliminated or minimized through design modification at the earliest point in the development effort. 104/210 105. Failure mode and effects analysisTherefore, the FMECA should be performed at the system level as soon as preliminary design information is available and extended to the lower levels as the detail design progresses. 105/210 106. Failure mode and effects analysis Remark: For more complete scenario modelling other type of Reliability analysis may be considered, for example fault tree analysis(FTA); a deductive (backward logic) failure analysis that may handle multiple failures within the item and/or external to the item including maintenance and logistics. It starts at higher functional / system level. A FTA may use the basic failure mode FMEA records or an effect summary as one of its inputs (the basic events). Interface hazard analysis, Human error analysis and others may be added for completion in 106/210 scenario modelling. 107. Functional analysisThe analysis may be performed at the functional level until the design has matured sufficiently to identify specific hardware that will perform the functions; then the analysis should be extended to the hardware level. When performing the hardware level FMECA, interfacing hardware is considered to be operating within specification. In addition, each part failure postulated is considered to be the only failure in the system (i.e., it is a single failure analysis).107/210 108. Functional analysisIn addition to the FMEAs done on systems to evaluate the impact lower level failures have on system operation, several other FMEAs are done. Special attention is paid to interfaces between systems and in fact at all functional interfaces. The purpose of these FMEAs is to assure that irreversible physical and/or functional damage is not propagated across the interface as a result of failures in one of the interfacing units. 108/210 109. Functional analysisThese analyses are done to the piece part level for the circuits that directly interface with the other units. The FMEA can be accomplished without a CA, but a CA requires that the FMEA has previously identified system level critical failures. When both steps are done, the total process is called a FMECA. 109/210 110. Ground rulesThe ground rules of each FMEA include a set of project selected procedures; the assumptions on which the analysis is based; the hardware that has been included and excluded from the analysis and the rationale for the exclusions. The ground rules also describe the indenture level of the analysis, the basic hardware status, and the criteria for system and mission success. 110/210 111. Ground rulesEvery effort should be made to define all ground rules before the FMEA begins; however, the ground rules may be expanded and clarified as the analysis proceeds. A typical set of ground rules (assumptions) follows:111/210 112. Ground rulesOnly one failure mode exists at a time. All inputs (including software commands) to the item being analyzed are present and at nominal values. All consumables are present in sufficient quantities. Nominal power is available 112/210 113. BenefitsMajor benefits derived from a properly implemented FMECA effort are as follows:113/210 114. BenefitsIt provides a documented method for selecting a design with a high probability of successful operation and safety.114/210 115. BenefitsA documented uniform method of assessing potential failure mechanisms, failure modes and their impact on system operation, resulting in a list of failure modes ranked according to the seriousness of their system impact and likelihood of occurrence. 115/210 116. BenefitsEarly identification of single failure points (SFPS) and system interface problems, which may be critical to mission success and/or safety. They also provide a method of verifying that switching between redundant elements is not jeopardized by postulated single failures. 116/210 117. BenefitsAn effective method for evaluating the effect of proposed changes to the design and/or operational procedures on mission success and safety.117/210 118. BenefitsA basis for in-flight troubleshooting procedures and for locating performance monitoring and faultdetection devices.118/210 119. BenefitsCriteria for early planning of tests.119/210 120. Basic termsThe following covers some basic FMEA terminology. Failure The loss under stated conditions.120/210 121. Basic terms Failure mode The specific manner or way by which a failure occurs in terms of failure of the item (being a part or (sub) system) function under investigation; it may generally describe the way the failure occurs. It shall at least clearly describe a (end) failure state of the item (or function in case of a Functional FMEA) under consideration. It is the result of the failure mechanism (cause of the failure mode). For example; a fully fractured axle, a deformed axle or a fully open or fully closed electrical contact are each a separate failure mode. 121/210 122. Basic terms Failure cause and/or mechanismDefects in requirements, design, process, quality control, handling or part application, which are the underlying cause or sequence of causes that initiate a process (mechanism) that leads to a failure mode over a certain time. A failure mode may have more causes. 122/210 123. Basic terms Failure cause and/or mechanismFor example; "fatigue or corrosion of a structural beam" or "fretting corrosion in a electrical contact" is a failure mechanism and in itself (likely) not a failure mode. The related failure mode (end state) is a "full fracture of structural beam" or "an open electrical contact". The initial Cause might have been "Improper application of corrosion protection layer (paint)" and /or "(abnormal) vibration input from another (possible failed) system". 123/210 124. Basic terms / Failure effectImmediate consequences of a failure on operation, function or functionality, or status of some item.124/210 125. Indenture levels (bill of material or functional breakdown)An identifier for system level and thereby item complexity. Complexity increases as levels are closer to one.125/210 126. Local effectThe failure effect as it applies to the item under analysis.126/210 127. Next higher level effectThe failure effect as it applies at the next higher indenture level.127/210 128. End effectThe failure effect at the highest indenture level or total system.128/210 129. DetectionThe means of detection of the failure mode by maintainer, operator or built in detection system, including estimated dormancy period (if applicable)129/210 130. Risk Priority Number (RPN)Cost (of the event) * Probability (of the event occurring) * Detection (Probability that the event would not be detected before the user was aware of it)130/210 131. SeverityThe consequences of a failure mode. Severity considers the worst potential consequence of a failure, determined by the degree of injury, property damage, system damage and/or time lost to repair the failure.131/210 132. Remarks / mitigation / actionsAdditional info, including the proposed mitigation or actions used to lower a risk or justify a risk level or scenario.132/210 133. Example FMEA Worksheet133/210 134. Probability (P)In this step it is necessary to look at the cause of a failure mode and the likelihood of occurrence. This can be done by analysis, calculations / FEM, looking at similar items or processes and the failure modes that have been documented for them in the past. A failure cause is looked upon as a design weakness. All the potential causes for a failure mode should be identified and documented. 134/210 135. Probability (P)This should be in technical terms. Examples of causes are: Human errors in handling, Manufacturing induced faults, Fatigue, Creep, Abrasive wear, erroneous algorithms, excessive voltage or improper operating conditions or use (depending on the used ground rules). A failure mode is given an Probability Ranking. 135/210 136. Probability (P)136/210 137. Severity (S)Determine the Severity for the worst case scenario adverse end effect (state). It is convenient to write these effects down in terms of what the user might see or experience in terms of functional failures. Examples of these end effects are: full loss of function x, degraded performance, functions in reversed mode, too late functioning, erratic functioning, etc. 137/210 138. Severity (S)Each end effect is given a Severity number (S) from, say, I (no effect) to VI (catastrophic), based on cost and/or loss of life or quality of life. These numbers prioritize the failure modes (together with probability and detectability). Below a typical classification is given. Other classifications are possible. See also hazard analysis. 138/210 139. Severity (S)139/210 140. Detection (D)140/210 141. Detection (D)The means or method by which a failure is detected, isolated by operator and/or maintainer and the time it may take. This is important for maintainability control (Availability of the system) and it is specially important for multiple failure scenarios. 141/210 142. Detection (D)This may involve dormant failure modes (e.g. No direct system effect, while a redundant system / item automatic takes over or when the failure only is problematic during specific mission or system states) or latent failures (e.g. deterioration failure mechanisms, like a metal growing crack, but not a critical length). 142/210 143. Detection (D)It should be made clear how the failure mode or cause can be discovered by an operator under normal system operation or if it can be discovered by the maintenance crew by some diagnostic action or automatic built in system test. A dormancy and/or latency period may be entered. 143/210 144. Detection (D)144/210 145. Detection (D) DORMANCY or LATENCY PERIOD The average time that a failure mode may be undetected may be entered if known. For example: During aircraft C Block inspection, preventive or predictive maintenance, X months or X flight hours During aircraft B Block inspection, preventive or predictive maintenance, X months or X flight hours During Turn-Around Inspection before or after flight (e.g. 8 hours average) During in-built system functional test, X minutes Continuously monitored, X seconds 145/210 146. Detection (D)INDICATION If the undetected failure allows the system to remain in a safe / working state, a second failure situation should be explored to determine whether or not an indication will be evident to all operators and what corrective action they may or should take.146/210 147. Detection (D)Indications to the operator should be described as follows: Normal. An indication that is evident to an operator when the system or equipment is operating normally. Abnormal. An indication that is evident to an operator when the system has malfunctioned or failed. Incorrect. An erroneous indication to an operator due to the malfunction or failure of an indicator (i.e., instruments, sensing devices, visual or audible warning devices, etc.). 147/210 148. Detection (D)PERFORM DETECTION COVERAGE ANALYSIS FOR TEST PROCESSES AND MONITORING (From ARP4761 Standard):148/210 149. Detection (D)This type of analysis is useful to determine how effective various test processes are at the detection of latent and dormant faults. The method used to accomplish this involves an examination of the applicable failure modes to determine whether or not their effects are detected, and to determine the percentage of failure rate applicable to the failure modes which are detected. The possibility that the detection means may itself fail latent should be accounted for in the coverage analysis as a limiting factor (i.e., coverage cannot be more reliable than the detection means availability). 149/210 150. Detection (D)Inclusion of the detection coverage in the FMEA can lead to each individual failure that would have been one effect category now being a separate effect category due to the detection coverage possibilities. Another way to include detection coverage is for the FTA to conservatively assume that no holes in coverage due to latent failure in the detection method affect detection of all failures assigned to the failure effect category of concern. The FMEA can be revised is necessary for those cases where this conservative assumption does not allow the top event probability requirements to be met. 150/210 151. Detection (D)After these three basic steps the Risk level may be provided.151/210 152. Risk level (P*S) and (D)Risk is the combination of End Effect Probability And Severity. Where probability and severity includes the effect on non-detectability (dormancy time). This may influence the end effect probability of failure or the worst case effect Severity. The exact calculation may not be easy in case multiple scenarios (with multiple events) are possible and detectability / dormancy plays a crucial role (as for redundant systems). In that case Fault Tree Analysis and/or Event Trees may be needed to determine exact probability and risk levels. 152/210 153. Risk level (P*S) and (D)Preliminary Risk levels can be selected based on a Risk Matrix like shown below, based on Mil. Std. 882.[24] The higher the Risk level, the more justification and mitigation is needed to provide evidence and lower the risk to an acceptable level. High risk should be indicated to higher level management, who are responsible for final decision making.153/210 154. Risk level (P*S) and (D)154/210 155. Risk level (P*S) and (D)After this step the FMEA has become like a FMECA.155/210 156. TimingThe FMEA should be updated whenever: A new cycle begins (new product/process) Changes are made to the operating conditions A change is made in the design New regulations are instituted Customer feedback indicates a problem156/210 157. UsesDevelopment of system requirements that minimize the likelihood of failures. Development of designs and test systems to ensure that the failures have been eliminated or the risk is reduced to acceptable level. Development and evaluation of diagnostic systems To help with design choices (trade-off analysis).157/210 158. AdvantagesImprove the quality, reliability and safety of a product/process Improve company image and competitiveness Increase user satisfaction Reduce system development time and cost Collect information to reduce future failures, capture engineering knowledge158/210 159. AdvantagesReduce the potential for warranty concerns Early identification and elimination of potential failure modes Emphasize problem prevention Minimize late changes and associated cost Catalyst for teamwork and idea exchange between functions Reduce the possibility of same kind of failure in future Reduce impact on company profit margin Improve production yield 159/210 160. LimitationsIf used as a top-down tool, FMEA may only identify major failure modes in a system. Fault tree analysis (FTA) is better suited for "top-down" analysis. When used as a "bottom-up" tool FMEA can augment or complement FTA and identify many more causes and failure modes resulting in top-level symptoms. It is not able to discover complex failure modes involving multiple failures within a subsystem, or to report expected failure intervals of particular failure modes up to the upper level subsystem or system. 160/210 161. LimitationsAdditionally, the multiplication of the severity, occurrence and detection rankings may result in rank reversals, where a less serious failure mode receives a higher RPN than a more serious failure mode. The reason for this is that the rankings are ordinal scale numbers, and multiplication is not defined for ordinal numbers. The ordinal rankings only say that one ranking is better or worse than another, but not by how much. For instance, a ranking of "2" may not be twice as severe as a ranking of "1," or an "8" may not be twice as severe as a "4," but multiplication treats them as though they are. See Level of measurement for further discussion.161/210 162. TypesFunctional: before design solutions are provided (or only on high level) functions can be evaluated on potential functional failure effects. General Mitigations ("design to" requirements) can be proposed to limit consequence of functional failures or limit the probability of occurrence in this early development. It is based on a functional breakdown of a system. This type may also be used for Software evaluation. 162/210 163. TypesConcept Design / Hardware: analysis of systems or subsystems in the early design concept stages to analyse the failure mechanisms and lower level functional failures, specially to different concept solutions in more detail. It may be used in trade-off studies.163/210 164. TypesDetailed Design / Hardware: analysis of products prior to production. These are the most detailed (in mil 1629 called Piece-Part or Hardware FMEA) FMEAs and used to identify any possible hardware (or other) failure mode up to the lowest part level. It should be based on hardware breakdown (e.g. the BoM = Bill of Material). Any Failure effect Severity, failure Prevention (Mitigation), Failure Detection and Diagnostics may be fully analysed in this FMEA. 164/210 165. TypesProcess: analysis of manufacturing and assembly processes. Both quality and reliability may be affected from process faults. The input for this FMEA is amongst others a work process / task Breakdown.165/210 166. 166/210 167. HOW TO CONDUCT AN EFFECTIVE SAFETY ASSESSMENT OFFICE SPACES 168. Why should you be conducting assessments? To spot unsafe conditions and equipment To focus on unsafe work practices or behavior trends before they lead to injuries Reveal the need for new safeguards To provide a safe working environment for all workers 169. What should I look for during an office assessment? Emergency Egress Work Environment Ergonomics Emergency Information Fire Prevention Electrical Systems Employee Behavior 170. Emergency Egress Blocked or locked doorways Locking devices that can impede emergency egress Properly marked exits Properly illuminated exits Clear aisles and pathways 171. Work Environment Clean, sanitary and orderly work spaces Tripping hazards such as loose tiles, carpeting, flooring Are drawers kept open when not in use Are items stored above shoulder level and unsecured 172. Ergonomics Are workstations configured to prevent employee discomfort and injury Are employees aware of ergonomic risk factors Have employees received ergonomic training 173. Emergency Information Are emergency phone numbers posted where they can be readily found Are employees trained in emergency procedures Are evacuation procedures and diagrams posted 174. Fire Prevention Are portable fire extinguishers readily available and unobstructed Are fire pull stations clearly marked and unobstructed Are all fire sprinkler heads kept clear and unobstructed (at least 18 inches) Are space heaters used and authorized 175. Electrical Systems Are extension cords/power strips kept uncoupled (piggy-backed) Are all extension cords/power strips provided by the agency Are electrical outlets clear of combustible materials Do electrical cords create trip hazards Are extension cords used as permanent wiring 176. Employee Behavior Are employees observing established safety rules Do employees minimize hazards by applying Operational Risk Management principles Are employee allowed to report unsafe conditions or acts without restraint 177. Operational Risk Management IdentifySuperviseAssessORMControlDecide 178. How to assess safety SUMMARY Promoting Safety Monthly Assessment Program Positive Findings (above & beyond minimum requirements) Assessments emergency info, egress, environment, ergonomics, fire prevention, electrical, unsafe behavior 179. Risk Assessment and Management 180. Getting the Measure of Risk Having understood the potential accident sequences associated with a hazard (e.g. using ETA) Next step is to determine the severity of the credible accidents identified Remember risk is the product of severity and probability of an accident Two different approaches: Estimate probability of accident, and hence get a measure of accident risk then decide whether estimated risk is acceptable Used in many domains, including rail, military aerospace Will discuss this approach first, using rail standards as 181. Accident Severity Accident Severity Categories are qualitative descriptions of consequences of failure conditions (hazards) considering likely impactSeverity LevelConsequence to Persons or EnvironmentConsequence to ServiceCatastrophicFatalities and/or multiple severe injuries and/or major damage to the environmentCriticalSingle fatality and/or severe injury and/or significant damage to the environmentLoss of a major systemMarginalMinor injury and/or significant threat to the environmentSevere system(s) damageInsignificantPossible minor injuryMinor system damage EN 50126 182. Accident Probability Next, estimate (predict) accident probability Use historical results, analysis, and engineering judgment to determine appropriate qualitative probability category Note we may have to consider both how likely hazard is to arise how likely hazard is to develop into accidentCategoryDescriptionFrequentLikely to occur frequently. The hazard will be continually experienced.ProbableWill occur several times. The hazard can be expected to occur often.OccasionalLikely to occur several times. The hazard can be expected to occur several timesRemoteLikely to occur sometime in the system lifecycle. The hazard can reasonably be expected to occurImprobableUnlikely to occur, but possible. It can be assumed that the hazard will exceptionally occur.IncredibleExtremely unlikely to occur. It can be assumed that the hazard may not occur.EN 50126 183. Classifying Risk Having assigned severity and probability associated with hazard consequences Next step is to use a Hazard Risk Matrix to classify the the risk Frequency of occurrence of a hazardous eventRisk LevelsFrequentUndesirableIntolerableIntolerableIntolerableProbableTolerableUndesirableIntolerableIntolerableOccasionalNegligibleUndesirableUndesirableIntolerableRemoteNegligibleTolerableUndesirableUndesirableImprobableNegligibleNegligibleTolerableTolerableIncredibleNegligibleNegligibleNegligibleNegligibleInsignificantMarginalCriticalCatastrophicSeverity Level of Hazard Consequence EN 50126 184. Accepting Risk Reasoning about risk Using HRI now possible to say, e.g. Risk(Hazard H1) > Risk(Hazard H2) In order to say what is acceptable / unacceptable, must provide an interpretation, Risk Actions to be applied against each category e.g.Category IntolerableUndesirableShall be eliminatedShall only be accepted when risk reduction is impracticable and with the agreement of the Railway Authority or the Safety Regulatory Authority, as appropriateTolerableAcceptable with adequate control and with the agreement of the Railway AuthorityNegligibleAcceptable with the agreement of the Railway Authority EN 50126 185. Managing Risk Risk Resolution Can associate objectives or actions with risk class, e.g. technologies used development processes assessment criteriaExample, for undesirable risk, might decide no single point of failure shall lead to system accident probability of fatality must be < 1x10-8 per hour failure behaviour over time (lifetime of system) 186. Determining Risk - Civil Aerospace Style 1 Start with determination of severity very similar to rail categoriesARP 4761 187. Determining Risk - Civil Aerospace Style 2 When severity has been determined, can set objectives (requirements) for risk control primarily boundaries on acceptable probability of failure condition (hazard) S e v e r ity C la s s ific a tio nP r o b a b ility O b je c tiv e Q u a n tita tiv e D e s c r ip tiv e(p e r flig h t h o u r )C a ta s tro p h icE x tr e m e ly Im p r o b a b le< 1 0 -9H a z a rd o u sE x tr e m e ly R e m o te1 0 -7 t o 1 0 -9M a jo rR e m o te1 0 -5 t o 1 0 -7R e a s o n a b ly P r o b a b le1 0 -3 t o 1 0 -5M in o rF re q u e n t> 10-3Adapted from ARP 4761 188. Determining Risk - Civil Aerospace Style 3 For civil aerospace, severity-related objectives are set in standards easy to work with unambiguous provided you can agree on standardised and objective measures of severity!BUT Need to understand that direct mapping from severity to probability objectives is based on important assumption: 189. Determining Risk - Civil Aerospace Style 4 Where does acceptable risk come from? in principle, requirements reflect what risk the public is willing to accept risk (A) = probability (A) * severity (A) level of acceptable risk hard to determine, and subjectivein practice, certification bodies (airworthiness authorities) act as surrogates for the public bottom line is hull loss rate civil aviation hull loss rate target is currently 10 -7 per flying hour for comparison, military aviation (UK) hull loss rate 190. Determining Risk - Civil Aerospace Style 5 Has further implications: implicit assumption about number of catastrophic failure conditions on an aircraft also implicit assumption about how probable failure condition is to actually develop into an accidentExample: probability objective (target) for catastrophic failure condition is < 10-9 per flight hour target hull loss rate is < 10-7 per flight hour implies either a maximum of 100 catastrophic failure conditions on an aircraft, assuming all occurrences of catastrophic failure conditions will 191. Determining Risk - Civil Aerospace Style 6 Note that objective of probability per flying hour has its problems Consider: histogram shows accidents / time 1.8% of accidents occur in load / taxi / unload 192. The ALARP Principle 1 ALARP = As Low As Reasonably Practicable R is k c a n n o t b e ju s tif ie d o n a n y g ro u n d s IN T O L E R A B L ET H E A LA R P ( A s L o w A s R e a s o n a b ly P r a c t ic a b le ) R E G IO N R is k is u n d e r t a k e n o n ly if b e n e f it is d e s ir e dB R O A D LY A C C E P T A B LE R E G IO NTO LE R A B LE o n ly if r is k r e d u c tio n s a r e im p r a c t ic a b le o r c o s t g r o s s ly d is p r o p o r tio n a te to th e im p r o v e m e n t g a in e d TO LE R A B LE if c o s t o f r e d u c t io n w o u ld e x c e e d im p r o v e m e n t g a in e d N E G L IG IB L E R IS K 193. The ALARP Principle 2 Provides an interpretation of identified risks Pragmatic although you can always spend more money to improve safety, it is not always cost-effective However, cost-effectiveness introduces ambiguity Regions of tolerability defined by regulatory domain and customer Approach is often implicit in the management of safety-critical projects anyway 194. Risk Reduction Flowchart 1 Identify and determine risk associated with identified hazards ID E N T IF Y H A Z A R D a n d R IS K H a za rd Id e n tific a tio n S y s te m D e s ig nH a z a r d R is k (S e v e r ity /P r o b a b ility ) E s ta b lis h e d 195. Risk Reduction Flowchart 2 Id e n tify H a z a r d a n d R is k H a za rd Id e n tific a tio nA S S E S S R IS KH a z a r d R is k (S e v e r ity /P r o b a b ility ) E s ta b lis h e dR is k M e a s u r e d A g a in s t H R I M a tr ix C r ite r iaS y s te m D e s ig n NoR is k Yes A c c e p ta b le ? 196. Risk Reduction Flowchart 3 Id e n tify H a z a r d a n d R is k H a za rd Id e n tific a tio n S y s te m D e s ig nH a z a r d R is k (S e v e r ity /P r o b a b ility ) E s ta b lis h e dR is k M e a s u r e d A g a in s t H R I M a tr ix C r ite r iaT A K E A C T IO N A p p ly R e -d e s ig n P re c e d e n c e C r ite r iaO p e ra to r / C re w T r a in in g R e q u ir e dA s s e s s R is k1. 2. 3. 4.NoR is k Yes A c c e p ta b le ?C o n tin u e d e s ig n . D o c u m e n t a n a ly s is a n d ju s tific a tio nR e d e s ig n to e lim in a te h a z a r d , o r r e d u c e lik e lih o o d In c o r p o r a te m itig a tio n , e .g . s a fe ty d e v ic e s P r o v id e w a r n in g s D e v e lo p p r o c e d u r e s a n d tr a in in g 197. Precedence in Risk Reduction 1 Redesign to eliminate risk Best where practical Redesign to reduce hazard likelihood Select architecture or components Change in operational role, or removal of hazardous materialDuplex or triplex or Higher integrity components, with lower failure ratesIncorporate mitigation to reduce impact of failures Automated protection, e.g. pressure relief valves Where incorporated, need to check periodically 198. Precedence in Risk Reduction 2 Provide warning devices Detect the hazardous condition and warn operators Provide procedures and training Reduce likelihood of hazard, or mitigate may involve use of personal protective equipmentDo not assume procedures are enough by themselves e.g. indicate that landing gear has not fully deployed e.g. to evacuate building due to fire or fumesconsider evolution of power guillotine regulationsPrecedence order 199. Residual Risk - 1 Residual Risks are those that cannot be designed out risks inherent to design, where benefit is desirableSignificant residual risks must be formally accepted by the appropriate authority (typically customer / operator) Can use Decision Authority Matrix, e.g. Hazard Severity CategoriesFrequency of OccurrenceIIIIIIIVCATASTROPHICCRITICALMARGINALNEGLIGIBLEAFREQUENTHIGHHIGHHIGHMEDIUMBPROBABLEHIGHHIGHMEDIUMLOWCOCCASIONALHIGHHIGHMEDIUMLOWDREMOTEHIGHMEDIUMLOWLOWEIMPROBABLEMEDIUMLOWLOWLOW(MIL-STD-882C) 200. Residual Risk 2 Appropriate Decision Authority (From MIL-STD882C) HIGH Service Acquisition Executive e.g. no ground collision avoidance on F22 signed off by 4-star Air Force GeneralMEDIUM Program Executive Officer LOW Program Manager Usually a requirement to document all actions taken to resolve risk within terms of contract Customer authority can then decide whether 201. Risk Management Summary Risk Assessment is the process of identifying the risk associated with system hazards Approach in many sectors (military, rail) is to use Hazard Risk Matrix to determine the risk associated with a hazard from severity and probability estimates then decide on acceptability of riskAlternative approach (Civil Aerospace) is based around severity assumption of fixed level of acceptable risk... so can derive objectives, including probability, from severity