132
@lozzd • @ryan_frantz Mean Time to Sleep Quantifying the on-call experience

Mean Time to Sleep: Quantifying the On-Call Experience

Embed Size (px)

DESCRIPTION

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

Citation preview

Page 1: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

Page 2: Mean Time to Sleep: Quantifying the On-Call Experience

Laurie Denness@lozzd

Ryan Frantz@ryan_frantz

Page 3: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Who is in an on-call rotation?

Page 4: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Who is on call right now?

Page 5: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Who feels like on-call sucks?

Page 6: Mean Time to Sleep: Quantifying the On-Call Experience
Page 7: Mean Time to Sleep: Quantifying the On-Call Experience

Welcome. How is on call?

Page 8: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Let’s help our people sleep

Page 9: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Make on-call more bearable

Page 10: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Changes

Page 11: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email toAcknowledge

Page 12: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 13: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 14: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 15: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 16: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 17: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 18: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email to Acknowledge• Replying “ack” with some context makes it appear in

IRC too

Page 19: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

Page 20: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

Page 21: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Email Only Alerts• Do you care if RAID becomes degraded in the middle of

the night?

• Do you care if one of your web/hadoop/X boxes dies in the middle of the night?

• Can it wait until the morning?

Page 22: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

Page 23: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

Page 24: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Previous service state

• Duration in that state

Added Context• Previous service state

Page 25: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

Page 26: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

Page 27: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

Page 28: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

Page 29: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

Page 30: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

Page 31: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to Runbook

Page 32: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Added Context• Previous service state

• Duration in that state

• Alert recipients

• Notes

• Link to runbook

Page 33: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

Page 34: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Alert Storms• Reduce noise when 200 things go wrong by aggregating

• Trigger alert percentage of pool over threshold

Page 35: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

Page 36: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Low friction downtime• IRC commands to downtime hosts/sets of hosts

Page 37: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

Page 38: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Downtime Reminders• Help prevent false notifications

Page 39: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

Page 40: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

Page 41: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

Page 42: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Event Handlers• Teach Nagios to augment the team

• Restarting services (nscd)

• Re-running jobs (transient errors)

• Duplicate crons (Chef)

Page 43: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

Page 44: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

Page 45: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Incremental Improvements?• Maybe

• More ideas; hoped they’d stick

• We didn’t know because we didn’t measure

Page 46: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

Page 47: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Measure Everything• “You can’t manage what you can’t measure.”

- Deming (not really)

• But, we weren’t measuring anything

Page 48: Mean Time to Sleep: Quantifying the On-Call Experience
Page 49: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?

Page 50: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

Page 51: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

Page 52: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

Page 53: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What should we measure?• Volume of alerts (total, by severity)

• Alert categorization (actionable vs not)

• Alert times: Off-hours?

• Noisy hosts/services

Page 54: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Opsweekly

Page 55: Mean Time to Sleep: Quantifying the On-Call Experience
Page 56: Mean Time to Sleep: Quantifying the On-Call Experience
Page 57: Mean Time to Sleep: Quantifying the On-Call Experience
Page 58: Mean Time to Sleep: Quantifying the On-Call Experience
Page 59: Mean Time to Sleep: Quantifying the On-Call Experience
Page 60: Mean Time to Sleep: Quantifying the On-Call Experience
Page 61: Mean Time to Sleep: Quantifying the On-Call Experience
Page 62: Mean Time to Sleep: Quantifying the On-Call Experience
Page 63: Mean Time to Sleep: Quantifying the On-Call Experience
Page 64: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz We have data.

Page 65: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

Page 66: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

Page 67: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

Page 68: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Aggregate alerts1. Look at reports

2. Wow, look at all those alerts for the same thing

3. Aggregate alerts

4.Profit

Page 69: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

Page 70: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

Page 71: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• Prevent alerts due to upstream issues (downed switch)

• Standard Nagios feature

• Computers can do this for us!

Page 72: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

Page 73: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

Page 74: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

Page 75: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Parent relationships• signalvnoise.com

• LLDP on host shows switch info

• Put switch info into Chef using ohai

• Create Nagios host configs based on data

Page 76: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

Page 77: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Service Dependencies• Hundreds of Graphite-sourced checks

• Created new template that sets a servicegroup that depends on the Graphite service.

Page 78: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

Page 79: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Keep on analyzing• It’s okay to just identify and delete alerts that don’t

mean anything!

• Or move them to email only

Page 80: Mean Time to Sleep: Quantifying the On-Call Experience
Page 81: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More Quantification!

Page 82: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

Page 83: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

Page 84: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the Year• Use reports

• Use search

• Identify noisiest alerts

Page 85: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Reviewing the YearYEARLY REPORT SCREENSHOTS

Page 86: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

Nagios Hack Day/Week

Page 87: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• Great time to look at this data and make improvements

• If Disk Space is the worst. Can we rethink that?

Nagios Hack Day/Week

Page 88: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

Page 89: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

Page 90: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Outsource Your Alerts• Etsy’s Search Team has on-call rotation

• A whole subset of alerts that don’t go to Ops

• More teams starting this but Search Team is at 100%

Page 91: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Sleep Tracking

Page 92: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 93: Mean Time to Sleep: Quantifying the On-Call Experience

“Track your life!” - @ph

Page 94: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 95: Mean Time to Sleep: Quantifying the On-Call Experience
Page 96: Mean Time to Sleep: Quantifying the On-Call Experience
Page 97: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 98: Mean Time to Sleep: Quantifying the On-Call Experience
Page 99: Mean Time to Sleep: Quantifying the On-Call Experience
Page 100: Mean Time to Sleep: Quantifying the On-Call Experience
Page 101: Mean Time to Sleep: Quantifying the On-Call Experience
Page 102: Mean Time to Sleep: Quantifying the On-Call Experience
Page 103: Mean Time to Sleep: Quantifying the On-Call Experience
Page 104: Mean Time to Sleep: Quantifying the On-Call Experience
Page 105: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 106: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Page 107: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?

Page 108: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?

Page 109: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

Page 110: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

Page 111: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Signal to noise ratio is much better

Page 112: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

Page 113: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

Page 114: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

Page 115: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Did it work?• Yes.

• Okay, so it’s a little more complicated than that

• Adding alerts all the time means new “annoying” things

• Keep monitoring

Page 116: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

What’s next?

Page 117: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

The Effect of Sleep

Page 118: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

The Effect of Sleep

Page 119: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

Page 120: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

• We focus on people’s sleep

• But not the effect on the person when they come to work the next day

• How do we measure the impact of sleep loss/deprivation?

The Effect of Sleep

• Subjective: Pittsburgh Sleepiness Scale

• Objective: Psychomotor vigilance task (PVT) to measure alertness

Page 121: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

Page 122: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Beyond Opsweekly• Employee wellness program

• Security have started using past sleep data to check for weird logins to systems

Page 123: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More context: nagios-herald

Page 124: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

Page 125: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

More reports• We have a bunch of data, we can build better reports,

drill down to analyze alerting trends

• Can we attribute particular actions to reduced noise volume?

• Aggregate alerts

• Non-downtimed alerts

Page 126: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Thanks

Page 127: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Etsy Ops Team

Page 128: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

SewMona

Page 129: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Open Source/Links• http://ryanfrantz.com/mtts

• https://github.com/etsy/opsweekly

• https://github.com/etsy/nagios-herald

• https://github.com/jonlives/jawboneup_to_graphite

• http://codeascraft.com

Page 130: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Questions?

Page 131: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience

Page 132: Mean Time to Sleep: Quantifying the On-Call Experience

@lozzd • @ryan_frantz

Mean Time to SleepQuantifying the on-call experience