87
THREE DAYS IN SEPTEMBER “Houston, We Have a Problem.” by Steve Feldman, @PerfForensics

3days september

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: 3days september

THREE DAYS IN SEPTEMBER “Houston, We Have a Problem.”

by Steve Feldman, @PerfForensics    

Page 2: 3days september

The Agenda I.  This is a True Story...It Really Did Happen II.  Houston, We Have a Problem. III.  The Really Good Vendors Care IV.  Getting to Zero V.  The Damage Was Already Done VI.  Where We Are Today

Page 3: 3days september
Page 4: 3days september
Page 5: 3days september
Page 6: 3days september
Page 7: 3days september
Page 8: 3days september
Page 9: 3days september
Page 10: 3days september
Page 11: 3days september

Houston, We Have a Problem...

Page 12: 3days september
Page 13: 3days september

Our Outage Affected our most Important Asset

Page 14: 3days september
Page 15: 3days september
Page 16: 3days september

Our Outage Was Caused By Human Error

Page 17: 3days september
Page 18: 3days september
Page 19: 3days september

NEVER REBOOT A UNIX MACHINE!

Page 20: 3days september
Page 21: 3days september

The Monitoring “Cameras” Should

Always Be On

Page 22: 3days september
Page 23: 3days september
Page 24: 3days september

24  

Page 25: 3days september

25  

Page 26: 3days september

26  

Page 27: 3days september

Keep Everyone Informed

Page 28: 3days september

Who Wants their Users to Report the Problem first?

Page 29: 3days september
Page 30: 3days september

Not All of the Data is Believable

Page 31: 3days september
Page 32: 3days september
Page 33: 3days september

Crisis are the Best Time to Determine the Strength of

the Team

Page 34: 3days september
Page 35: 3days september
Page 36: 3days september
Page 37: 3days september
Page 38: 3days september
Page 39: 3days september
Page 40: 3days september

Keep Your Boss Informed

Page 41: 3days september
Page 42: 3days september

Keep Your Users Informed

Page 43: 3days september
Page 44: 3days september

Keep Your Users Updated

Page 45: 3days september
Page 46: 3days september

Continue to Keep Your Users Updated

Page 47: 3days september
Page 48: 3days september

Getting to Zero

Page 49: 3days september
Page 50: 3days september
Page 51: 3days september

Log  Consolida0on  

Page 52: 3days september
Page 53: 3days september
Page 54: 3days september
Page 55: 3days september

Continue to Keep Your Users Updated

Page 56: 3days september
Page 57: 3days september
Page 58: 3days september

It is Not Just About Restoring Service

Page 59: 3days september

It is OK to Admit Mistakes

Page 60: 3days september
Page 61: 3days september

Let Your Boss Take Credit

Page 62: 3days september
Page 63: 3days september

Your Boss Did Not Build a Fragile System

Page 64: 3days september

Do a Post-Mortem

Page 65: 3days september
Page 66: 3days september

The Problem Started Long Before

Page 67: 3days september
Page 68: 3days september
Page 69: 3days september
Page 70: 3days september
Page 71: 3days september
Page 72: 3days september
Page 73: 3days september

Where We are Today

Page 74: 3days september

Practice Really Matters

Page 75: 3days september
Page 76: 3days september

Practice Failure

Page 77: 3days september
Page 78: 3days september

Look at Your Manuals

Page 79: 3days september
Page 80: 3days september

Practice Routines and Roles

Page 81: 3days september
Page 82: 3days september

Practice Everyday

Page 83: 3days september
Page 84: 3days september
Page 85: 3days september
Page 86: 3days september

NEVER REBOOT A UNIX MACHINE!

Page 87: 3days september

Thanks for Listening

@PerfForensics