37
KEVINA FINN-BRAUN SALESFORCE J. PAUL REED RELEASE ENGINEERING APPROACHES DEVOPS ENTERPRISE SUMMIT, 2015 THE BLAMELESS CLOUD: BRINGING ACTIONABLE RETROSPECTIVES TO SALESFORCE

DOES15 - Finn-Braun and Reed - The Blameless Cloud: Bringing Actionable Retrospectives to Salesforce

Embed Size (px)

Citation preview

K E V I N A F I N N - B R A U N S A L E S F O R C E

J . PA U L R E E D R E L E A S E E N G I N E E R I N G A P P R O A C H E S

D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 5

T H E B L A M E L E S S C L O U D : B R I N G I N G A C T I O N A B L E R E T R O S P E C T I V E S T O S A L E S F O R C E

K E V I N A F I N N - B R A U N

• Director of Site Reliability Service Management at Salesforce

• Business Continuity at Yahoo

• Geeks out on Group Dynamics and Behavior

• @kfinnbraun on

• Prepping for the zombie apocalypse

@kfinnbraun @jpaulreed#DOES15

J . PA U L R E E D

• @jpaulreed on

• Host of The Ship Show, @shipshowpodcast on

• Principal Consultant, Release Engineering Approaches

• Spend my days talking to organizations about “The DevOps™”

@kfinnbraun @jpaulreed#DOES15

“ S I T E R E L I A B I L I T Y ” AT S A L E S F O R C E

• Primary operational team supporting availability

• Acceptance and validation activities

• Develop and implement operational improvements for SFDC

• “Game days”@kfinnbraun @jpaulreed#DOES15

S E R V I C E R E L I A B I L I T Y H U R D L E S AT S F D C

• Inconsistent application of process, leading to inconsistent information collection

• Incident handling/remediation crossing silo boundaries

• Confusion over service ownership, due to restructured responsibilities

• Disjointed, “heavyweight” meetings

• Postmortems centered around “The Old View” of human error

@kfinnbraun @jpaulreed#DOES15

L A N G U A G E O F T H E “ O L D V I E W ”

• “5 whys”

• “Root cause” analysis

• “Why didn’t you[r team]…”

• “You[r team] should have…”

• “Best practices”

@kfinnbraun @jpaulreed#DOES15

@kfinnbraun @jpaulreed#DOES15

T H E T I M E L I N E

• October 2014: First Meeting

• January 2015: “Blow up” HA Forum

• April 2015: Status Check, including assessment shared with senior leaders

• May 2015: Service ownership roles shift

@kfinnbraun @jpaulreed#DOES15

T H E T I M E L I N E

• October 2014: First Meeting

• January 2015: “Blow up” HA Forum

• April 2015: Status Check, including assessment shared with senior leaders

• May 2015: Service ownership roles shift

• July 2015: Initial Workshop on “The New View”

• August 2015: Identified first group for coaching

• August 2015 — today: Continued focus and deep-dive on WSRR

• August 2015 — today: Weekly sessions with the initial group

@kfinnbraun @jpaulreed#DOES15

Incident, Event, Bug

Initial Analysis

RCKnown?

Root Cause Analysis Workflow Goal: Root cause identified five business days from incident resolution.

Facilitator opens investigations and schedules post mortem

meeting

Request RCA/Failure Analysis N

RC Identified?

Identify corrective actions and

implementation plans; Assign

actions to scrum teams

Y RCM Needed?

RCM Process

Unable to ascertain root cause; update record with “KE

Status”

Engage scrum teams as required.

HA Forum

Y

N

Corrective Actions

complete?

Weekly meetings to follow up with scrum master on

progress

Review @HA?

Y

Y

Additional work items from HA are

assigned.

Update record and set status to

“resolved”Y

NEND

END

HA? Incident Guidelines..Severity 0,1: YESSeverity 2 : Maybe (instance & incident length?)Functional Regression: MaybeIncorrect/Incomplete Release: YESDeployment Delayed or Rolled Back: Maybe

Impact to Customer/Production or ability to release?

Tier 3 support communicate RCM to customer(s)

N

R O O T C A U S E A N A LY S I S W O R K F L O W

• Designed & implemented two years ago

• Anchored the process around the weekly “HA Forum”

• Intended to apply to all incidents…

• In practice, focused on high profile incidents

@kfinnbraun @jpaulreed#DOES15

Incident, Event, Bug

Initial Analysis

RCKnown?

Root Cause Analysis Workflow Goal: Root cause identified five business days from incident resolution.

Facilitator opens investigations and schedules post mortem

meeting

Request RCA/Failure Analysis N

RC Identified?

Identify corrective actions and

implementation plans; Assign

actions to scrum teams

Y RCM Needed?

RCM Process

Unable to ascertain root cause; update record with “KE

Status”

Engage scrum teams as required.

HA Forum

Y

N

Corrective Actions

complete?

Weekly meetings to follow up with scrum master on

progress

Review @HA?

Y

Y

Additional work items from HA are

assigned.

Update record and set status to

“resolved”Y

NEND

END

HA? Incident Guidelines..Severity 0,1: YESSeverity 2 : Maybe (instance & incident length?)Functional Regression: MaybeIncorrect/Incomplete Release: YESDeployment Delayed or Rolled Back: Maybe

Impact to Customer/Production or ability to release?

Tier 3 support communicate RCM to customer(s)

N

@kfinnbraun @jpaulreed#DOES15

Incident, Event, Bug

Initial Analysis

RCKnown?

Root Cause Analysis Workflow Goal: Root cause identified five business days from incident resolution.

Facilitator opens investigations and schedules post mortem

meeting

Request RCA/Failure Analysis N

RC Identified?

Identify corrective actions and

implementation plans; Assign

actions to scrum teams

Y RCM Needed?

RCM Process

Unable to ascertain root cause; update record with “KE

Status”

Engage scrum teams as required.

HA Forum

Y

N

Corrective Actions

complete?

Weekly meetings to follow up with scrum master on

progress

Review @HA?

Y

Y

Additional work items from HA are

assigned.

Update record and set status to

“resolved”Y

NEND

END

HA? Incident Guidelines..Severity 0,1: YESSeverity 2 : Maybe (instance & incident length?)Functional Regression: MaybeIncorrect/Incomplete Release: YESDeployment Delayed or Rolled Back: Maybe

Impact to Customer/Production or ability to release?

Tier 3 support communicate RCM to customer(s)

N

R O O T C A U S E A N A LY S I S W O R K F L O W I N R E A L I T Y• Silo transition boundaries evident

in the workflow

• Some had little/no contact, via the process, with other teams required to perform their job

• Sampling of incident reports uncovered consistent inconsistencies

• The “Bermuda Blob”@kfinnbraun @jpaulreed#DOES15

G E T T I N G A F E E L F O R T H E W E AT H E R

@kfinnbraun @jpaulreed#DOES15

@kfinnbraun @jpaulreed#DOES15

H E A D F I R S T I N T O T H E S T O R M

@kfinnbraun @jpaulreed#DOES15

L A N G U A G E : M AT T E R S

• “HA Forum” ➡ “WSRR”

• “WAR” (What is it good for?)

• Postmortem versus Retrospective

• Problem Team versus Solution Team

• Root Cause versus Proximate Cause

@kfinnbraun @jpaulreed#DOES15

B E H AV I O R : M AT T E R S

• Intra-team behavior

• Inter-team behavior

• This is not “#NAFB”

• “People in complex systems create safety. … The occasional human contribution to failure occurs because complex systems need an overwhelming human contribution for safety.” — Sydney Dekker

@kfinnbraun @jpaulreed#DOES15

S T R U C T U R E : M AT T E R S

@kfinnbraun @jpaulreed#DOES15

S T R U C T U R E : M AT T E R S

@kfinnbraun @jpaulreed#DOES15

“ B L A M E L E S S ” “ P O S T M O R T E M S ” ?

• Brené Brown, research sociologist, on vulnerability

• “Blame is a way to discharge pain and discomfort”

• Postmortem has a heavy connotation

• “Awesome postmortems?” Really?!

@kfinnbraun @jpaulreed#DOES15

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

@kfinnbraun - #DOES15 - @jpaulreed

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line”

“I’m getting sent to the principal’s office because

of this outage”

Completes the

post-incident “paperwork”

No formal retrospective/ hallway retrospectives @kfinnbraun - #DOES15 - @jpaulreed

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line”

“I’m getting sent to the principal’s office because

of this outage”

“Let’s fix this as fast as possible”

“What’s the correct fix to avoid this specific issue

in the future?”

Completes the

post-incident “paperwork”

No formal retrospective/ hallway retrospectives

Some information

(inconsistently) recorded

Jump to a focus on why

@kfinnbraun - #DOES15 - @jpaulreed

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line”

“I’m getting sent to the principal’s office because

of this outage”

“Let’s fix this as fast as possible”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that”

“We need to find the root cause of this incident”

Completes the

post-incident “paperwork”

No formal retrospective/ hallway retrospectives

Some information

(inconsistently) recorded

Jump to a focus on why

Follows the prescribed format for retrospectives

Have and incorporate complete dataset for the incident

into the retrospective

@kfinnbraun - #DOES15 - @jpaulreed

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line”

“I’m getting sent to the principal’s office because

of this outage”

“Let’s fix this as fast as possible”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that”

“We need to find the root cause of this incident”

“Now that we’ve established what happened,

how did it happen?”

“How did these multiple factors

influence our complex system?

Completes the

post-incident “paperwork”

No formal retrospective/ hallway retrospectives

Some information

(inconsistently) recorded

Jump to a focus on why

Follows the prescribed format for retrospectives

Have and incorporate complete dataset for the incident

into the retrospective

Identifies inherent bias

in self and others

Perspectives solicited from all involved team members/functional groups

@kfinnbraun - #DOES15 - @jpaulreed

Lang

uage

Beha

viors

Novice Competent Proficient ExpertBeginner

“Incidents are bad; my job is on the line”

“I’m getting sent to the principal’s office because

of this outage”

“Let’s fix this as fast as possible”

“What’s the correct fix to avoid this specific issue

in the future?”

“Let’s review the timeline/incident

report to answer that”

“We need to find the root cause of this incident”

“Now that we’ve established what happened,

how did it happen?”

“How did these multiple factors

influence our complex system?

“How does our team/system contribute to our successes?”

“What can we incorporate from this incident to

better respond next time?”

Completes the

post-incident “paperwork”

No formal retrospective/ hallway retrospectives

Some information

(inconsistently) recorded

Jump to a focus on why

Follows the prescribed format for retrospectives

Have and incorporate complete dataset for the incident

into the retrospective

Identifies inherent bias

in self and others

Perspectives solicited from all involved team members/functional groups

Able to facilitate retrospectives by healthily helping others address

tendency to blame/ personal & systemic bias

Retrospective outcomes are fed back into the system and prioritized

@kfinnbraun - #DOES15 - @jpaulreed

R E T R O S P E C T I V E S FA C I L I TAT E T H E S E R V I C E ( A N D D E V E L O P M E N T ! )

I M P R O V E M E N T P R O C E S S

@kfinnbraun @jpaulreed#DOES15

B E I N G “ T O O B U S Y ” T O L E A R N O R I M P R O V E M E A N S Y O U A R E I N

A D O W N W A R D S P I R A L , B Y D E F I N I T I O N

@kfinnbraun @jpaulreed#DOES15

I T ’ S N O T A B O U T T H E O U T C O M E . I T ’ S A B O U T T H E R E S P O N S E .

@kfinnbraun @jpaulreed#DOES15

W H Y + H O W I S M O R E I M P O R TA N T T H A N

W H AT

@kfinnbraun @jpaulreed#DOES15

Y O U A R E N E V E R D O N E .

@kfinnbraun @jpaulreed#DOES15

Y O U . A R E . N E V E R . D O N E .

@kfinnbraun @jpaulreed#DOES15

O U R F O R E C A S T F O R T H E F U T U R E

• Evolving the concept of Service Ownership

• Salesforce-specific Retrospective Guides

• Global “live-site” coaching

• Refocus on getting the business what it wants

@kfinnbraun @jpaulreed#DOES15

AV E N U E S F O R C O L L A B O R AT I O N

• How does the described Dreyfus model apply in other organizations?

• Would love to hear stories from other enterprises about their retrospective process, who does them, and where they live within the organization

@kfinnbraun @jpaulreed#DOES15

P H O T O C R E D I T S

• Slide 1: https://en.wikipedia.org/wiki/File:Golden_Fog,_San_Francisco.jpg

• Slide 4: Courtesy Kevina Finn-Braun/Salesforce

• Slide 6: https://www.flickr.com/photos/hannaneh/6464986121

• Slide 7: https://www.youtube.com/watch?v=_DEToXsgrPc#t=1h5m50s

• Slide 13: http://kathmajp.weebly.com/all-movie-reviews/movie-review-twister

• Slide 14: http://thevane.gawker.com/heres-everything-they-got-wrong-and-right-in-the-movi-1609968202

• Slide 15: https://www.flickr.com/photos/ravedelay/17761863929

@kfinnbraun @jpaulreed#DOES15

P H O T O C R E D I T S

• Slide 16: Screenshot of aviationweather.gov

• Slide 17: https://www.flickr.com/photos/ravedelay/17534032771/

• Slide 18: https://www.youtube.com/watch?v=8veT5QspylE#t=15m30s

• Slide 19: https://www.flickr.com/photos/jkirkhart35/4984385396

• Slide 20: https://www.youtube.com/watch?v=iCvmsMzlF7o

• Slide 33: https://commons.wikimedia.org/wiki/File:Rainbow_background.jpg

• Slide 35: https://en.wikipedia.org/wiki/File:Clouds_spilling_over_San_Francisco.jpg

@kfinnbraun @jpaulreed#DOES15