44
The evolving role of context in Incident Management

Sandstorm or Significant? The evolving role of situational context in incident management

Embed Size (px)

Citation preview

Page 1: Sandstorm or Significant? The evolving role of situational context in incident management

The evolving role of context in Incident Management

Page 2: Sandstorm or Significant? The evolving role of situational context in incident management

Matthew BoeckmanDeveloper Advocate

Victorops.com/blog

@matthewboeckmanBackground

● 18 years on-call Ops● 15 years w/software

teams● Startup junkie● DevOps enthusiast

Page 3: Sandstorm or Significant? The evolving role of situational context in incident management

3

What is VictorOps?

VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical layer between your alerts and the people who receives them.

Page 4: Sandstorm or Significant? The evolving role of situational context in incident management

victorops.com/IMA

Page 5: Sandstorm or Significant? The evolving role of situational context in incident management

5

5 Phases of Incident Management

Detection

monitoring, metrics, thresholds

Response

alerting,on-call,escalation

Remediation

fixes,tickets,deployments

Analysis

postmortem,how or why,understand

Readiness

improvement,game days,learning

Page 6: Sandstorm or Significant? The evolving role of situational context in incident management

6

Standard Incident Workflow

Detection Response Remediation

AnalysisReadiness

Page 7: Sandstorm or Significant? The evolving role of situational context in incident management

7

Incident Management Assessment Matrix

Detection Response Remediation Analysis Preparedness

Novice

Beginner

Competent

Proficient

Expert

Page 8: Sandstorm or Significant? The evolving role of situational context in incident management

8

Incident Management Maturity Matrix

Detection Response Remediation Analysis Preparedness

Novice

Beginner xCompetent x xProficient x x

Expert

Page 9: Sandstorm or Significant? The evolving role of situational context in incident management

9

Self Assessment

Poll: How would you rate your overall team maturity?

A. NoviceB. BeginnerC. CompetentD. ProficientE. Expert

Page 10: Sandstorm or Significant? The evolving role of situational context in incident management

10

The Focus Question

How can we help teams

mature their incident management practice

(Stated plainly: Make On-Call suck less)

Page 11: Sandstorm or Significant? The evolving role of situational context in incident management

11

Situational Context

Page 12: Sandstorm or Significant? The evolving role of situational context in incident management

12

Incident Management Key Metrics

● MTTR Mean time to Repair(MTTR)● Availability (SLA)● Ticket Volumes● Escalations● Customer Satisfaction

Page 13: Sandstorm or Significant? The evolving role of situational context in incident management

13

Incident Management Key Metrics

Page 14: Sandstorm or Significant? The evolving role of situational context in incident management

14

Time Spent Managing Incidents - Low Maturity

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

Page 15: Sandstorm or Significant? The evolving role of situational context in incident management

15

Time Spent Managing Incidents - Medium Maturity

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

Page 16: Sandstorm or Significant? The evolving role of situational context in incident management

16

Time Spent Managing Incidents - High Maturity

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Page 17: Sandstorm or Significant? The evolving role of situational context in incident management

17

A New Core Metric

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Time to Learn(TTL)

Identify trendsCapacity planImprove infrastructure

GamedaysCross trainUpdate runbooks

Page 18: Sandstorm or Significant? The evolving role of situational context in incident management

18

Beep Beep Beep

Page 19: Sandstorm or Significant? The evolving role of situational context in incident management

19

Standard Incident Workflow

Page 20: Sandstorm or Significant? The evolving role of situational context in incident management

20

Standard Diagnostic Procedure

1. Fire up the VPN

2. Navigate dashboards, find relevant section

3. Review ticket or incident history for host

4. Review Runbooks for associated host

Page 21: Sandstorm or Significant? The evolving role of situational context in incident management

21

Common Bottlenecks to Establishing Context

● Multiple sources of record● Duplicate Runbooks or documentation● Metric overload

● New responders unfamiliar with systems

Page 22: Sandstorm or Significant? The evolving role of situational context in incident management

22

Where Does it Hurt?

Poll: Which is the most painful problem you experience in establishing context

A. Multiple sources of recordB. Duplicate documentationC. Metric overloadD. Everything is equally on fireE. Everything is fantastic

Page 23: Sandstorm or Significant? The evolving role of situational context in incident management

23

Beep Beep Beep

Page 24: Sandstorm or Significant? The evolving role of situational context in incident management

24

A Tale of Two Graphs

Massive spike above expected norm

Response: Fire up the laptop and put a pot of coffee on

Page 25: Sandstorm or Significant? The evolving role of situational context in incident management

25

A Tale of Two Graphs

Small spike for a consistently loaded box.

Response: ACK alert, go back to sleep

Page 26: Sandstorm or Significant? The evolving role of situational context in incident management

26

This Time, with Context!

Page 27: Sandstorm or Significant? The evolving role of situational context in incident management

27

Enhanced Contextual Workflow

Page 28: Sandstorm or Significant? The evolving role of situational context in incident management

28

Alert Enhancements

Poll: My team is doing some enhancement of alerts today.

A. TrueB. False

Page 29: Sandstorm or Significant? The evolving role of situational context in incident management

Many incidents can be tracked to deploys

Developer Velocity = Constant Change

Silos impair communication

29

CI/CD Exacerbates the Contextual Challenge

Page 30: Sandstorm or Significant? The evolving role of situational context in incident management

30

A Tale of Two Incidents

Page 31: Sandstorm or Significant? The evolving role of situational context in incident management

31

A Tale of Two Incidents

Page 32: Sandstorm or Significant? The evolving role of situational context in incident management

32

Introducing: The Scientific Method

Make Observations (the measurement)

Ask a question (why would a webserver quit working?)

Form a hypothesis (because we just deployed?)

Page 33: Sandstorm or Significant? The evolving role of situational context in incident management

33

The Sandstorm

Page 34: Sandstorm or Significant? The evolving role of situational context in incident management

34

No. Do not.

Page 35: Sandstorm or Significant? The evolving role of situational context in incident management

35

Measure Everything: the Anti-pattern

Measurements cost time and money

Busy dashboards lead to sub-concious filtering

Measurements create a natural impulse to alert

Page 36: Sandstorm or Significant? The evolving role of situational context in incident management

36

Enhance

Page 37: Sandstorm or Significant? The evolving role of situational context in incident management

37

Stop

Page 38: Sandstorm or Significant? The evolving role of situational context in incident management

38

An Embarrassment of Dashboards

Page 39: Sandstorm or Significant? The evolving role of situational context in incident management

39

Rule of Thumb

Measure much

Alert on some

Contextualize all

Page 40: Sandstorm or Significant? The evolving role of situational context in incident management

40

Iteration is Key

Dialing in context takes time

Conduct blameless postmortems

Experiment with more and less context

Be objective in your assessment of what works

Page 41: Sandstorm or Significant? The evolving role of situational context in incident management

41

Leverage Situational Context

Providing incident responders with context

can meaningfully impact MTTR

paying dividends in time

to move your practice forward

Page 42: Sandstorm or Significant? The evolving role of situational context in incident management

42

The Beginning

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

Page 43: Sandstorm or Significant? The evolving role of situational context in incident management

43

The Goal

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Time to Learn(TTL)

Identify trendsCapacity planImprove infrastructure

GamedaysCross trainUpdate runbooks

Page 44: Sandstorm or Significant? The evolving role of situational context in incident management

Take the IMA!http://victorops.com/ima

Questions?

44

Thank you!

Matthew Boeckman@matthewboeckman

Slides on devops.com & slideshare.com