Upload
gene-kim
View
718
Download
1
Embed Size (px)
Citation preview
K E V I N A F I N N - B R A U N S A L E S F O R C E
J . PA U L R E E D R E L E A S E E N G I N E E R I N G A P P R O A C H E S
D E V O P S E N T E R P R I S E S U M M I T, 2 0 1 5
T H E B L A M E L E S S C L O U D : B R I N G I N G A C T I O N A B L E R E T R O S P E C T I V E S T O S A L E S F O R C E
K E V I N A F I N N - B R A U N
• Director of Site Reliability Service Management at Salesforce
• Business Continuity at Yahoo
• Geeks out on Group Dynamics and Behavior
• @kfinnbraun on
• Prepping for the zombie apocalypse
@kfinnbraun @jpaulreed#DOES15
J . PA U L R E E D
• @jpaulreed on
• Host of The Ship Show, @shipshowpodcast on
• Principal Consultant, Release Engineering Approaches
• Spend my days talking to organizations about “The DevOps™”
@kfinnbraun @jpaulreed#DOES15
“ S I T E R E L I A B I L I T Y ” AT S A L E S F O R C E
• Primary operational team supporting availability
• Acceptance and validation activities
• Develop and implement operational improvements for SFDC
• “Game days”@kfinnbraun @jpaulreed#DOES15
S E R V I C E R E L I A B I L I T Y H U R D L E S AT S F D C
• Inconsistent application of process, leading to inconsistent information collection
• Incident handling/remediation crossing silo boundaries
• Confusion over service ownership, due to restructured responsibilities
• Disjointed, “heavyweight” meetings
• Postmortems centered around “The Old View” of human error
@kfinnbraun @jpaulreed#DOES15
L A N G U A G E O F T H E “ O L D V I E W ”
• “5 whys”
• “Root cause” analysis
• “Why didn’t you[r team]…”
• “You[r team] should have…”
• “Best practices”
@kfinnbraun @jpaulreed#DOES15
T H E T I M E L I N E
• October 2014: First Meeting
• January 2015: “Blow up” HA Forum
• April 2015: Status Check, including assessment shared with senior leaders
• May 2015: Service ownership roles shift
@kfinnbraun @jpaulreed#DOES15
T H E T I M E L I N E
• October 2014: First Meeting
• January 2015: “Blow up” HA Forum
• April 2015: Status Check, including assessment shared with senior leaders
• May 2015: Service ownership roles shift
• July 2015: Initial Workshop on “The New View”
• August 2015: Identified first group for coaching
• August 2015 — today: Continued focus and deep-dive on WSRR
• August 2015 — today: Weekly sessions with the initial group
@kfinnbraun @jpaulreed#DOES15
Incident, Event, Bug
Initial Analysis
RCKnown?
Root Cause Analysis Workflow Goal: Root cause identified five business days from incident resolution.
Facilitator opens investigations and schedules post mortem
meeting
Request RCA/Failure Analysis N
RC Identified?
Identify corrective actions and
implementation plans; Assign
actions to scrum teams
Y RCM Needed?
RCM Process
Unable to ascertain root cause; update record with “KE
Status”
Engage scrum teams as required.
HA Forum
Y
N
Corrective Actions
complete?
Weekly meetings to follow up with scrum master on
progress
Review @HA?
Y
Y
Additional work items from HA are
assigned.
Update record and set status to
“resolved”Y
NEND
END
HA? Incident Guidelines..Severity 0,1: YESSeverity 2 : Maybe (instance & incident length?)Functional Regression: MaybeIncorrect/Incomplete Release: YESDeployment Delayed or Rolled Back: Maybe
Impact to Customer/Production or ability to release?
Tier 3 support communicate RCM to customer(s)
N
R O O T C A U S E A N A LY S I S W O R K F L O W
• Designed & implemented two years ago
• Anchored the process around the weekly “HA Forum”
• Intended to apply to all incidents…
• In practice, focused on high profile incidents
@kfinnbraun @jpaulreed#DOES15
Incident, Event, Bug
Initial Analysis
RCKnown?
Root Cause Analysis Workflow Goal: Root cause identified five business days from incident resolution.
Facilitator opens investigations and schedules post mortem
meeting
Request RCA/Failure Analysis N
RC Identified?
Identify corrective actions and
implementation plans; Assign
actions to scrum teams
Y RCM Needed?
RCM Process
Unable to ascertain root cause; update record with “KE
Status”
Engage scrum teams as required.
HA Forum
Y
N
Corrective Actions
complete?
Weekly meetings to follow up with scrum master on
progress
Review @HA?
Y
Y
Additional work items from HA are
assigned.
Update record and set status to
“resolved”Y
NEND
END
HA? Incident Guidelines..Severity 0,1: YESSeverity 2 : Maybe (instance & incident length?)Functional Regression: MaybeIncorrect/Incomplete Release: YESDeployment Delayed or Rolled Back: Maybe
Impact to Customer/Production or ability to release?
Tier 3 support communicate RCM to customer(s)
N
@kfinnbraun @jpaulreed#DOES15
Incident, Event, Bug
Initial Analysis
RCKnown?
Root Cause Analysis Workflow Goal: Root cause identified five business days from incident resolution.
Facilitator opens investigations and schedules post mortem
meeting
Request RCA/Failure Analysis N
RC Identified?
Identify corrective actions and
implementation plans; Assign
actions to scrum teams
Y RCM Needed?
RCM Process
Unable to ascertain root cause; update record with “KE
Status”
Engage scrum teams as required.
HA Forum
Y
N
Corrective Actions
complete?
Weekly meetings to follow up with scrum master on
progress
Review @HA?
Y
Y
Additional work items from HA are
assigned.
Update record and set status to
“resolved”Y
NEND
END
HA? Incident Guidelines..Severity 0,1: YESSeverity 2 : Maybe (instance & incident length?)Functional Regression: MaybeIncorrect/Incomplete Release: YESDeployment Delayed or Rolled Back: Maybe
Impact to Customer/Production or ability to release?
Tier 3 support communicate RCM to customer(s)
N
R O O T C A U S E A N A LY S I S W O R K F L O W I N R E A L I T Y• Silo transition boundaries evident
in the workflow
• Some had little/no contact, via the process, with other teams required to perform their job
• Sampling of incident reports uncovered consistent inconsistencies
• The “Bermuda Blob”@kfinnbraun @jpaulreed#DOES15
L A N G U A G E : M AT T E R S
• “HA Forum” ➡ “WSRR”
• “WAR” (What is it good for?)
• Postmortem versus Retrospective
• Problem Team versus Solution Team
• Root Cause versus Proximate Cause
@kfinnbraun @jpaulreed#DOES15
B E H AV I O R : M AT T E R S
• Intra-team behavior
• Inter-team behavior
• This is not “#NAFB”
• “People in complex systems create safety. … The occasional human contribution to failure occurs because complex systems need an overwhelming human contribution for safety.” — Sydney Dekker
@kfinnbraun @jpaulreed#DOES15
“ B L A M E L E S S ” “ P O S T M O R T E M S ” ?
• Brené Brown, research sociologist, on vulnerability
• “Blame is a way to discharge pain and discomfort”
• Postmortem has a heavy connotation
• “Awesome postmortems?” Really?!
@kfinnbraun @jpaulreed#DOES15
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line”
“I’m getting sent to the principal’s office because
of this outage”
Completes the
post-incident “paperwork”
No formal retrospective/ hallway retrospectives @kfinnbraun - #DOES15 - @jpaulreed
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line”
“I’m getting sent to the principal’s office because
of this outage”
“Let’s fix this as fast as possible”
“What’s the correct fix to avoid this specific issue
in the future?”
Completes the
post-incident “paperwork”
No formal retrospective/ hallway retrospectives
Some information
(inconsistently) recorded
Jump to a focus on why
@kfinnbraun - #DOES15 - @jpaulreed
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line”
“I’m getting sent to the principal’s office because
of this outage”
“Let’s fix this as fast as possible”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that”
“We need to find the root cause of this incident”
Completes the
post-incident “paperwork”
No formal retrospective/ hallway retrospectives
Some information
(inconsistently) recorded
Jump to a focus on why
Follows the prescribed format for retrospectives
Have and incorporate complete dataset for the incident
into the retrospective
@kfinnbraun - #DOES15 - @jpaulreed
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line”
“I’m getting sent to the principal’s office because
of this outage”
“Let’s fix this as fast as possible”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that”
“We need to find the root cause of this incident”
“Now that we’ve established what happened,
how did it happen?”
“How did these multiple factors
influence our complex system?
Completes the
post-incident “paperwork”
No formal retrospective/ hallway retrospectives
Some information
(inconsistently) recorded
Jump to a focus on why
Follows the prescribed format for retrospectives
Have and incorporate complete dataset for the incident
into the retrospective
Identifies inherent bias
in self and others
Perspectives solicited from all involved team members/functional groups
@kfinnbraun - #DOES15 - @jpaulreed
Lang
uage
Beha
viors
Novice Competent Proficient ExpertBeginner
“Incidents are bad; my job is on the line”
“I’m getting sent to the principal’s office because
of this outage”
“Let’s fix this as fast as possible”
“What’s the correct fix to avoid this specific issue
in the future?”
“Let’s review the timeline/incident
report to answer that”
“We need to find the root cause of this incident”
“Now that we’ve established what happened,
how did it happen?”
“How did these multiple factors
influence our complex system?
“How does our team/system contribute to our successes?”
“What can we incorporate from this incident to
better respond next time?”
Completes the
post-incident “paperwork”
No formal retrospective/ hallway retrospectives
Some information
(inconsistently) recorded
Jump to a focus on why
Follows the prescribed format for retrospectives
Have and incorporate complete dataset for the incident
into the retrospective
Identifies inherent bias
in self and others
Perspectives solicited from all involved team members/functional groups
Able to facilitate retrospectives by healthily helping others address
tendency to blame/ personal & systemic bias
Retrospective outcomes are fed back into the system and prioritized
@kfinnbraun - #DOES15 - @jpaulreed
R E T R O S P E C T I V E S FA C I L I TAT E T H E S E R V I C E ( A N D D E V E L O P M E N T ! )
I M P R O V E M E N T P R O C E S S
@kfinnbraun @jpaulreed#DOES15
B E I N G “ T O O B U S Y ” T O L E A R N O R I M P R O V E M E A N S Y O U A R E I N
A D O W N W A R D S P I R A L , B Y D E F I N I T I O N
@kfinnbraun @jpaulreed#DOES15
I T ’ S N O T A B O U T T H E O U T C O M E . I T ’ S A B O U T T H E R E S P O N S E .
@kfinnbraun @jpaulreed#DOES15
O U R F O R E C A S T F O R T H E F U T U R E
• Evolving the concept of Service Ownership
• Salesforce-specific Retrospective Guides
• Global “live-site” coaching
• Refocus on getting the business what it wants
@kfinnbraun @jpaulreed#DOES15
AV E N U E S F O R C O L L A B O R AT I O N
• How does the described Dreyfus model apply in other organizations?
• Would love to hear stories from other enterprises about their retrospective process, who does them, and where they live within the organization
@kfinnbraun @jpaulreed#DOES15
Kevina Finn-Braun [email protected] http://lnkdin.me/kevinafinnbraun
J. Paul Reed [email protected]
http://jpaulreed.com
P H O T O C R E D I T S
• Slide 1: https://en.wikipedia.org/wiki/File:Golden_Fog,_San_Francisco.jpg
• Slide 4: Courtesy Kevina Finn-Braun/Salesforce
• Slide 6: https://www.flickr.com/photos/hannaneh/6464986121
• Slide 7: https://www.youtube.com/watch?v=_DEToXsgrPc#t=1h5m50s
• Slide 13: http://kathmajp.weebly.com/all-movie-reviews/movie-review-twister
• Slide 14: http://thevane.gawker.com/heres-everything-they-got-wrong-and-right-in-the-movi-1609968202
• Slide 15: https://www.flickr.com/photos/ravedelay/17761863929
@kfinnbraun @jpaulreed#DOES15
P H O T O C R E D I T S
• Slide 16: Screenshot of aviationweather.gov
• Slide 17: https://www.flickr.com/photos/ravedelay/17534032771/
• Slide 18: https://www.youtube.com/watch?v=8veT5QspylE#t=15m30s
• Slide 19: https://www.flickr.com/photos/jkirkhart35/4984385396
• Slide 20: https://www.youtube.com/watch?v=iCvmsMzlF7o
• Slide 33: https://commons.wikimedia.org/wiki/File:Rainbow_background.jpg
• Slide 35: https://en.wikipedia.org/wiki/File:Clouds_spilling_over_San_Francisco.jpg
@kfinnbraun @jpaulreed#DOES15