Advanced PostMortem Fu and Human Error 101 (Velocity 2011)

Embed Size (px)

DESCRIPTION

Not sure that these slides will make too much sense without the video, but here they are.

Citation preview

  • 1.AdvancedPostMortem Fu& Human Error 101John AllspawVP, Tech OpsVelocity 2011

2. We WantYOUetsy.com/careers 3. ScienceTimeTravelMythbustingReadingHomework 4. Here To Challenge You 5. Resilience EngineeringDr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook 6. Complex, Dynamic 7. Fundamental Surprises 8. E.T.T.O.Eciency Thoroughness 9. Organizations, Policies, Procedures, Regulations BLUNT Resources & Constraints 10. Organizations, Policies, Procedures,RegulationsBLUNTResources & ConstraintsSlips AdjustmentsOperatorsCompensations MistakesLapsesRecoveries Violations ImprovisationsSHARP 11. FUTURE+PAST 12. Why Do Them? 13. Why?Understand the failure 14. Why?Understand the system 15. Where System = NetworksServers ApplicationsProcesses People 16. Where System = NetworksServers ApplicationsProcessesPeople 17. People 18. (Anticipation)KnowingWhatTo Expect 19. (Anticipation)Knowing KnowingWhatWhatTo ExpectTo Look For (Monitoring) 20. (Anticipation) (Response)Knowing KnowingKnowingWhatWhatWhatTo ExpectTo Look ForTo Do (Monitoring) 21. (Anticipation) (Response)Knowing KnowingKnowingKnowingWhatWhatWhatWhatTo ExpectTo Look ForTo Do Has Happened (Monitoring) (Learning) 22. (Anticipation) (Response)Knowing KnowingKnowingKnowingWhatWhatWhatWhatTo ExpectTo Look ForTo Do Has Happened (Monitoring) (Learning) 23. Microphones are ON? 24. Event AwarenessCode Deploys 25. TIMELINEIRC logging = Rich Data 26. TIMELINE Traces of DataIRC logging = Rich Data 27. IRC Logs Fed Into Solr 28. Status BlogTwitter Feed 29. Flight Data Recorder 30. Annotation Traces 31. Investigation Basics 32. Start? 33. TTDHow? 34. TTR 35. Stable(all clear) 36. Impact Time = TTR - Start 37. SEVERITY 38. Severity 1A. Total loss of serviceB. Severe degradation, eectively unusableC. Loss of a critical feature 39. Severity 2A. Major degradation/feature loss for SUBSET of membersB. Minor degradation/feature loss for ALL members 40. Severity 3Noticeable non-critical feature loss or degradation 41. Severity 4No visible impact, loss of redundancy or capacityheadroom 42. Severity 5No-impact but unexpected failure 43. 5/11/2011 - Payments/Checkout system issue Start 4:10pm TTD 4:15pm TTR 4:30pm Stable4:35pm Total Impact20 min Severity 1 44. Basic metrics Timeline with details Remediations/Observations 45. normalIncident PostMortemoperationTime 46. How?Why?Prevention? 47. How 48. Crisis PatternsProblem StartsPostMortem Time 49. Crisis PatternsProblem StartsDetection PostMortemTime 50. Crisis PatternsProblem StartsDetection Evaluation PostMortemTime 51. Crisis PatternsProblem StartsDetection Evaluation Response PostMortem Time 52. Crisis PatternsProblem StartsDetection Evaluation ResponseStable PostMortem Time 53. Crisis PatternsProblem StartsDetection Evaluation ResponseStable PostMortem Conrmation Time 54. Crisis PatternsProblem StartsDetection Evaluation ResponseStable PostMortem Conrmation All Clear Time 55. Crisis PatternsProblem StartsStressDetection Evaluation ResponseStable PostMortem Conrmation All Clear Time 56. Crisis PatternsProblem StartsPostMortem Time 57. Crisis PatternsProblem StartsDetection PostMortemTime 58. Crisis PatternsProblem StartsDetection Evaluation PostMortemTime 59. Crisis PatternsProblem StartsDetection EvaluationResponse PostMortemTime 60. Crisis PatternsProblem StartsDetection EvaluationResponseStable PostMortemTime 61. Crisis PatternsProblem StartsDetection EvaluationResponseStablePostMortemConrmationTime 62. Crisis PatternsProblem StartsDetection EvaluationResponseStablePostMortemConrmationAll ClearTime 63. Crisis PatternsProblem Starts StressDetection EvaluationResponseStablePostMortemConrmationAll ClearTime 64. Crisis PatternsProblem StartsPostMortem Time 65. Crisis PatternsProblem StartsDetection PostMortemTime 66. Crisis PatternsProblem StartsDetectionEvaluationPostMortem Time 67. Crisis PatternsProblem StartsDetectionEvaluation Response PostMortemTime 68. Crisis PatternsProblem StartsDetectionEvaluation Response StablePostMortemTime 69. Crisis PatternsProblem StartsDetectionEvaluation Response Stable PostMortem ConrmationTime 70. Crisis PatternsProblem StartsDetectionEvaluation Response Stable PostMortem Conrmation All ClearTime 71. Crisis PatternsProblem StartsStressDetectionEvaluation Response Stable PostMortem Conrmation All ClearTime 72. Problem StartsDetectionEvaluation Response Stable PostMortem Conrmation All ClearTime 73. Problem StartsDetectionEvaluation Response Stable PostMortem Conrmation All ClearTime 74. Problem StartsDetectionEvaluation Response Stable PostMortem Conrmation All ClearTimeInternal and External Update Communications 75. Crisis PatternsProblem StartsDetectionEvaluation Response Stable Conrmation All ClearTime 76. Crisis Patterns 77. Crisis PatternsForced beyond learned roles 78. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddicult to see 79. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddicult to seeCognitively and perceptively noisy 80. Crisis PatternsForced beyond learned rolesActions whose consequences are both important anddicult to seeCognitively and perceptively noisyCoordinative load increases exponentially 81. Thematic Vagabondingbuttery minds NOT STUCK ENOUGH 82. Goal Fixation (encystment)TOO STUCK 83. HeroismNon-communicating lone wolf-isms 84. DistractionIrrelevant noise in comm channels 85. IMPROVISATIONREQUIREMENT for troubleshootingcomplex systems 86. IMPROVISATION 87. IMPROVISATION 88. Why? 89. Root Cause AnalysisWith the unknown, one is confronted with danger,discomfort, and care; the rst instinct is to abolishthese painful states.First principle: any explanation is better thannone. Friedrich NietzscheTwilight of the Idols, or How to Philosophize with a Hammer 90. Hindsight BiasKnowledge of the outcome inuences theanalysis of the processMakes steps towards failure appearforeseeable and obvious 91. After The Fact 92. After The Fact 93. Reality: Before and During 94. Hindsight BiasShould have known betterAll the signs were there, you just needed topay attention 95. Hindsight BiasBEFORE accident:The future seems implausibleAFTER accident: Obviously clear: how could they not see what mistake they were about to make 96. Hindsight Bias...peoples need to be rightis stronger than their abilityto be objective.N. CrawfordAmerican Psychological Association 97. Outcome BiasJudging a past decisionbased on its outcome. 98. IDFishbone/Ishikawa FMEA Five WhysFault TreeCEDCRT 99. why? OUTAGE 100. why?becausethis OUTAGE 101. why? why?becausethisOUTAGE 102. why? why?because becausethisthisOUTAGE 103. why? why? why? because because thisthis OUTAGE 104. why? why? why?because because becausethisthisthisOUTAGE 105. why?why? why?becausebecause becausethis thisthis OUTAGEbut: WHY? 106. which causedSome Caused otherAction Some Things thingsOUTAGE to happen 107. Sequence-Of-Events 108. Satisfyingly simple, easy to explain and document 109. Satisfyingly simple, easy to explain and document Solves for a specic case Ignorant of surrounding circumstances Too focused on components Validates Hindsight and Outcome bias 110. NOT HELPFUL Satisfyingly simple, easy to explain and document Solves for a specic case Ignorant of surrounding circumstances Too focused on components Validates Hindsight and Outcome bias 111. Epidemiological(adapted from Reason, 1990) 112. (adapted from Reason, 1990) 113. Holes = Active/Latent Failures,Bad Things Waiting to Happen(adapted from Reason, 1990) 114. Holes = Active/Latent Failures,Bad Things Waiting to HappenCheese = Safety Barriers, Layers of Defense(adapted from Reason, 1990) 115. (adapted from Reason, 1990) 116. Code Servers (adapted from Reason, 1990) 117. CodeServersSchedule Training (adapted from Reason, 1990) 118. Unmonitored Disk Space (latent condition) 119. Capacity Unmonitored Disk SpaceMiscalculation(latent condition)(latent condition) 120. Capacity Unmonitored Disk SpaceMiscalculation(latent condition)(latent condition) Unit Test In Transition 121. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test InTransition 122. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedTransition (active failure) 123. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 124. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 125. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 126. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 127. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 128. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 129. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) 130. Violation of known coding Capacitystandards Unmonitored Disk SpaceMiscalculation (latent condition) (latent condition)(latent condition)Unit Test In Bug IntroducedExternal API callTransition (active failure) (active failure) FAILURE! 131. CapacityMiscalculation(latent condition)Unit Test In Bug Introduced External API callTransition (active failure)(active failure) 132. CapacityMiscalculation(latent condition)Unit Test In Bug Introduced External API callTransition (active failure)(active failure) 133. CapacityMiscalculation(latent condition)Unit Test In Bug Introduced External API callTransition (active failure)(active failure) 134. CapacityMiscalculation(latent condition)Unit Test In Bug Introduced External API callTransition (active failure)(active failure)NO FAILURE! 135. Better than dominoes, but still linear layers of defense Helps uncover multiple contributors and latent failures (at sharp and blunt ends) 136. Better than dominoes, but still linear layers of defense Helps uncover multiple contributors and latent failures (atsharp and blunt ends) Doesnt explain lineups or orientation of holes Only identies defects/gaps, nothing more Still encourages judgements of decisions 137. Better, but still NOT HELPFUL Better than dominoes, but still linear layers of defense Helps uncover multiple contributors and latent failures (atsharp and blunt ends) Doesnt explain lineups or orientation of holes Only identies defects/gaps, nothing more Still encourages judgements of decisions 138. Multiple Contributors each necessary but only jointly sucient 139. Resultant Versus Emergent 140. SystemicDatabaseRouter MemcacheWebserver 141. SystemicDatabaseRouterMemcacheWebserverFeature Roadmap 142. SystemicDatabaseRouterMemcacheLast RoundWebserver of FundingFeature Roadmap 143. SystemicDashboard DesignDatabaseRouterMemcacheLast RoundWebserver of FundingFeature Roadmap 144. SystemicDashboardDesign DatabaseRouter Memcache Last Round Webserver of Funding FeatureRoadmapDBAs car 145. SystemicDashboardDesign DatabaseRouter Memcache Last Round Webserver of Funding FeatureRoadmap No EngDBAs carTraining 146. Systemic Dashboard DesignDatabaseRouterMemcacheLast RoundWebserver of FundingFeatureEmail is Roadmap downNo EngDBAs car Training 147. Systemic Dashboard DesignDatabaseRouterMemcacheLast RoundWebserver of FundingFeatureEmail is Roadmap down EntireOps TeamNo EngIs AtDBAs car TrainingVelocity 148. Systemic Dashboard DesignDatabaseRouterMemcache Last RoundWebserverof Funding FeatureEmail isRoadmap down EntireOps TeamNo Eng Techcrunch Is AtDBAs car Training ArticleVelocity 149. Systemic HiringDashboard DesignDifculties DatabaseRouterMemcache Last RoundWebserverof Funding FeatureEmail isRoadmap down EntireOps TeamNo Eng Techcrunch Is AtDBAs car Training ArticleVelocity 150. Systemic HiringS3 is slowDashboard DesignDifculties DatabaseRouter MemcacheLast RoundWebserver of Funding FeatureEmail isRoadmap down EntireOps TeamNo Eng Techcrunch Is AtDBAs car Training ArticleVelocity 151. Systemic HiringS3 is slowDashboard DesignDifculties DatabaseRouter MemcacheLast RoundWebserver of Funding FeatureEmail isRoadmap down EntireOps TeamNo Eng Techcrunch Is AtDBAs car Training ArticleVelocity 152. Systemic 153. Systemic 154. Systemic 155. Systemic 156. Systemic 157. Systemic 158. Systemic 159. Functional ResonanceIn isolation, components act within bounds.Interconnected, they produce emergingbehaviors. 160. Causes Are Constructed,Not Found WYLFIWYF Pre-conceived notions on causes and behaviors 161. Contributors, not causes 162. There is no root cause. 163. LEARNING 164. Quantifying ResponseTime to detect?Time for escalation, internal notication?Time to notify the public?Time to graceful degradation? (feature o)Time to stable/resolve?Time to all clear? 165. Qualifying ResponseHigh signal:noise in comm channels?Troubleshooting fatigue?Troubleshooting hando?All tools on-hand?Metrics visibility?Collaborative and skillful communication?Improvised tooling or solutions? 166. All Together Now Start/TTD/TTR/Stable/etc. Severity DATA (graphs, IRC, etc.) Description (timeline, etc.) Observations (motivations, latent conditions, etc.) Actions (remediation tickets, followup) 167. (Anticipation) (Response)Knowing KnowingKnowingKnowingWhatWhatWhatWhatTo ExpectTo Look ForTo Do Has Happened (Monitoring) (Learning) 168. (Anticipation) (Response)Knowing KnowingKnowingKnowingWhatWhatWhatWhatTo ExpectTo Look ForTo Do Has Happened (Monitoring) (Learning) 169. Human Error...knowledge and error ow from thesame mental sources, only successcan tell one from the other.Ernest Mach, 1905 170. Human ErrorNobody comes to work to do a bad job. 171. Human ErrorUseless as a label and ending point. 172. Human ErrorHuman error isnt a cause, itsan eect. 173. Why did it make senseto the personat the time? 174. Error Categories Slips Lapses Mismatches Violations 175. Error Categories 176. First StoriesHuman error seen as root cause.Counterfactuals: saying what they shouldhave done.Prevention: be more careful! 177. Second StoriesHuman error seen as systemic vulnerabilities,deeper inside the organization.Digging into why it made sense for them to do what theydid, at the time they did it.Prevention... 178. Why did it make senseto the personat the time? 179. Why did it make senseto the personat the time? 180. Why did it make senseto the personat the time? 181. Substitution TestCould peers have made thesame error under the samecircumstances? 182. WHERE to learn from ? 183. Two Propositions 184. 100 deploys6 deploy-related issues 185. 100 > 6 186. Proposition #1Ways in which things go right are special casesof the ways in which things go wrong. 187. Proposition #1Successes = failures gone wrongStudy the failures, generalize from that. data sources: 6 out of 100 188. Proposition #2Ways in which things go wrong are specialcases of the ways in which things go right. 189. Proposition #2Failures = successes gone wrongStudy the successes, generalize from thatdata sources: 94 out of 100 190. 94/100 ? OR6/100 ? 191. What and WHY Do ThingsGo RIGHT? 192. Not just:why did we fail?But also:why did we succeed? 193. Near MissesHey everybody -Dont be like me. I tried to X, butbecause it was no good.It almost exploded everyone.So, dont do: (details about X)Love, Joe 194. Taking the New ViewRecognize that human error isan attribution. 195. Taking the New ViewPursue Second Stories. 196. Taking the New ViewEscape Hindsight Bias. 197. Taking the New ViewUnderstand work as performedat the sharp end. 198. Taking the New ViewExamine how changes (at alllayers) will produce newvulnerabilities. 199. Taking the New ViewUse technology to support andenhance human expertise. 200. Taking the New ViewTame complexity through newforms of feedback. 201. Taking the New ViewRealize that your systems arenot inherently safe. 202. Taking the New ViewHuman error is an inevitableby-product of strainedcomplex systems. 203. Taking the New ViewHuman error isnt at the root ofyour safety problems. 204. Taking the New ViewHuman error isnt random. 205. (Anticipation) (Response)Knowing KnowingKnowingKnowingWhatWhatWhatWhatTo ExpectTo Look ForTo Do Has Happened (Monitoring) (Learning) 206. Just Culture 207. Just CultureBalancing accountability withlearning. 208. Intentional Malice 209. NegligenceFound Severity of the Outage 210. NameBlameShame 211. NameBlameShame 212. NameWHY? Blame#!@%$Shame 213. Must set an example! 214. Has to be some fear that notdoing ones job correctly couldlead to punishment. 215. Must set an example! Punishing Deterrents is a notHas to be some fear thatLosing coulddoing ones job correctly Propositionlead to punishment. 216. Holding People Accountable!= Blaming People 217. AccountabilityandLearning Punishment For Errors 218. No Bad ApplesOnly Bad Theories of Error 219. NameWHY? Blame#!@%$Shame 220. Signs Of Old ViewGross MisconductCarelessnessNegligenceEgregious BehaviorWillful Violations 221. Discretionary Spaces 222. AcceptableUnacceptable 223. AcceptableUnacceptable 224. AcceptableUnacceptable 225. AcceptableUnacceptable 226. Acceptable (who draws this subjective line?)Unacceptable 227. Increase Accountability BySupporting Learning 228. Empower PeopleLet them own their own stories.Dont make people pay penalties.Allow them to educate the organization. 229. Reduce UncertaintyMake it clear who denesacceptable behavior. 230. Organizational RootsAccountability =Responsibility + Requisite Authority 231. (Anticipation) (Response)Knowing KnowingKnowingKnowingWhatWhatWhatWhatTo ExpectTo Look ForTo Do Has Happened (Monitoring) (Learning) 232. (Thanks, Fellas)Dr. Erik Hollnagel Dr. David Woods Dr. Sidney Dekker Dr. Richard Cook 233. Homework! 234. We WantYOUetsy.com/careers 235. Photo Creditshttp://www.ickr.com/photos/51035644987@N01/2678090600/ http://www.ickr.com/photos/67196253@N00/2941655917/ http://www.ickr.com/photos/stirwise/417629641 http://www.ickr.com/photos/38383999@N06/3888057995/ http://www.ickr.com/photos/94443490@N00/361543080/ http://www.ickr.com/photos/7729940@N06/4333396494/ http://www.ickr.com/photos/cpstorm/167418602 http://www.ickr.com/photos/63474264@N00/4366221069/ http://www.ickr.com/photos/proimos/4199675334 http://www.ickr.com/photos/30475691@N07/2862060992/ http://www.ickr.com/photos/14663487@N00/797755046/ http://www.ickr.com/photos/25080113@N06/5361445631/ 236. THE END