Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Say Goodbye to Post MortemsSay Hello to
Effective Problem Management
Charles T. FoySiemens Medical Solutions USA, Inc.Health Services [email protected]
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 2
Company: Siemens, AG
Our division: healthcare software Our department: application hosting Mainframe, mid-range, open systems, distributed systems All operating systems (except Tandem) My role
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 3
Caution!
This company founded by former employees of International Business Machines (IBM)
Proclivity for acronyms is part of the culture.
Proclivity: “a natural or habitual inclination or tendency; propensity; predisposition”
You have been warned…
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 4
Agenda
What drove creation of a Problem Management System? First steps Give it a name? Got Lucky! Build versus Buy It’s a Defect! What to track? Classifications? Database Structure The Process Trending Benefits
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 6
What drove creation of a Problem Management System?
Disparate, inconsistent ‘post-mortems’ Usually driven by customer demand for an explanation Needed a defined process
Consistent across the company Communicates to the customer – internal and external
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 7
First StepsLaunch:
Assigned to a small group Two service delivery managers One consultant (employee #26) Quality Assurance and Process Definition expert
No detailed marching ordersother than “standard post-mortem process”
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 8
First StepsStarted with….
StandardizedText
Document
StandardizedText
Document
Root Cause Root Cause
StandardizedText
Document
Follow-up
Root Cause FieldFollow up Field
Document
Database
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 9
First StepsDefined our own goal:
Redefined project outcomes: reduces unscheduled outages increases availability communicates the root cause and preventive measures
implemented to internal and external audiences
Has to: Drive to the root cause In a searchable manner, track:
outage details root causes, corrective actions customer communications preventive measures implementation status etc
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 10
First StepsGive it a Name?
Needed a new name no longer a “Post Mortem” process“Post Mortem” didn’t sit wellBefore fully ITIL-aware
Never Happened!
How about a working title for our project?Perhaps the Post Event Analysis Process, a.k.a. PEAP?Always change it later on
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 11
Thus, PEAP was born!
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 12
And if the Post Event Analysis Processproduces a Report,
it of course would be called….
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 13
First StepsPost Mortem Report new name:
The Post Event Analysis ReportOr
PEAR
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 14
Define the database and process
Database needs:1. Description, short term resolution, root cause2. Customers impacted, length of outage3. Corrective actions implemented & their status4. Etc.
Process:1. Capture the root cause2. Ensure the corrective action was implemented3. Communicate all the above
Seemed straightforward, linear, one to one…
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 16
Next Steps – define the database requirementsWe Got Lucky!
Ran into a friend… Provided us with an excellent service outage to use as our model Decided to use it as proof of concept
Slowdown affecting almost all his applications, Response time dropped to zero within 5 minutes…
Started looking like it was the Storage Area Network (SAN)
Started looking for commonalities – network was suspect A Configuration Management Database (CMDB) would have helped!
Problem cleared up, 45 minutes into the event
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 17
The Outage Incident
Look up - Jake San Technician Fixes the problem! Not! Battery Swap! 45 minutes ago, looks good! Here’s what happened…
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 18
Root cause:
Battery was going to go bad and was swapped out. So Hardware is the root cause
But wait…is it really a Hardware issue?Battery didn’t actually die… it was Jake San Technician!
Human Error!But wait…is it really a “Human Error” issue?
Jake doing his jobOK, a… “Rules” issue – “always swap batteries off peak”
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 19
Root cause?
Aren’t these ‘contributing’ root causes?They didn’t know the battery was alertingSAN vendor knewSAN technician walked in and worked without their
knowledgeSAN technician educationData center employees educationNo battery swap rule/process
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 20
What would we put as our root cause?
Do we need to track all these ‘root’ causes?
Do we need to track the corrective actions for each?
Don’t most outages have multiple root causes?
Root cause?
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 21
Conclusion: MULTIPLE root causes
Multiple root causes, multiple follow-ups.
This would be complex.
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 23
Build a database?
Designed requirements, got a resource time estimatePresented to upper managementAnything on the shelf?
Essentially, you’re tracking defects!
Tools and Methodology Manager:
• Hardware that breaks• Software that breaks• Humans that make errors…
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 24
Defect Tracking
Company standard defect tracking applicationFully implemented and operational
Subject Matter Expert (SME)Does 90% of what you needEasy to implementWhat are your major defect
categories?
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 25
To build this, you need Classifications….
What are your major defect areas?
How granular?
Defect Tracking
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 27
The Classifications
Asked our peers
Specific type of hardware
Specific type of software
Human error
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 28
How much detail?
Major category (hardware)
The thing that broke (server)
Thing that caused it to break (bad power supply)
Model that broke (Fleetwood XL340)
The Classifications
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 29
Human Error
Does that work for Human Error?
Example: Jeff mistyped a static route in a backup router. Primary router fails. Backup router kicks in but does not recover all the interfaces…
Major category (human error)The thing that broke (typing)Thing that caused it to break (not enough sleep)Model that broke (Jeff)
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 30
Human Error?
Do we really want to say “human error”? What does it mean to make a human error? Failure To Follow A Process?
…FTFAP Eureka! A five letter acronym!
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 31
Classifications
Euphemism at first, then…The “Process” category was born!
Process Not Followed (a.k.a. Human Error)Process IncompleteProcess Incorrect (covers the “need to change the
Rule” root cause)Documentation wrong
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 32
More items to track
Version and vendor of the software/hardware? Name of the Human? Impacted application(s)? Impacted customer(s)? O/S level?, 3rd party software, something we wrote? Was this tested before it was put into production? Did it happen before? What is the air-speed velocity of an unladen swallow?
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 34
Database StructureSupports Multiple Levels of Classification
Global Keyword: allows for over-all groupings1. Hardware2. Software3. Process
Keyword 1 answers “What broke?” Answer: Server
Keyword 2 answers “What thing within KW1 broke?” Answer: Power Supply
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 35
Keyword Grouping Samples
Hardware
Keyword 1 Keyword2
Router Chassis
Memory
Nic Card
NPE
Pwr Supply
Keyword 1 Keyword2
Server Cable
CPU
Hard Drive
HBA
Memory
MthrBoard
Pwr Supply
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 36
Keyword Grouping Samples
Software
Keyword 1 Keyword2
Application A Print Subsys
GSM
RSA
Service Pack
CICS Configuration
Dayend Flow
MODS
PTF
Keyword 1 Keyword2
Server BIOS
Term Svcs
DHCP
Firewall
IIS
LDAP
Virus-Wm
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 37
Keyword Grouping Samples
Process
Keyword 1 Keyword2
Process Incomplete
Process Incorrect
Process Not Follow
Documentation Incorrect
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 38
Database Structure
Primary Root Cause
Contributing Root Cause 1
Contributing Root Cause 2
Contributing Root Cause 3
Contributing Root Cause 4
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 39
Database Structure
Battery Maintenance During Prime Time
Flaky Battery
SAN Technician working without their
knowledge
Vendor allowed in data center without
escort
No battery preventive maintenance
schedule
Hardware, SAN, Battery
Process, Process Incorrect
Process, Process Not Followed
Process, Process Incomplete
Process, Process Incorrect
All root causes and keywords
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 40
Database Structure
Primary Root Cause
Contributing Root Cause 1
Contributing Root Cause 2
Contributing Root Cause 3
Contributing Root Cause 4
Service Outage View
Preventive Action Item
Preventive Action Item
Preventive Action Item
Preventive Action Item
Preventive Action Item
All root causes and follow-ups
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 41
Battery Maintenance During Prime Time
Bad Battery
SAN Technician working without their
knowledge
Vendor allowed in data center without
escort
No battery preventive maintenance
schedule
Install second battery
Change process, require SAN Technician to get permisison from SAN group for all work
Change security process, no unescorted vendors
Create process to replace batteries every x months, well in
advance of MTBF
Change Battery Maintenance Process – swap is done off-peak
Service Outage View
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 43
The ProcessWho will own the process?
Owner? PEAP Owner role? (PO?)
We need action in the title… PEAP Driver (PD?)
How about a PEAP Owner/Driver? A POD!
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 44
The ProcessPOD role
Battery Maintenance During Prime Time
Flaky Battery
SAN Technician working without our
knowledge
Vendor allowed in data center without
escort
No battery preventive maintenance
schedule
Install second battery.
Change process, require SAN Technician to get permisison from SAN group for all work
Change security process, no unescorted vendors
Create process to replace batteries every x months, well in
advance of MTBF
Change Battery Maintenance Process – swap is done off-peak
Service Outage View
ID all root
causes
Describe Preventive
Action
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 45
The ProcessAssign follow ups…
Battery Maintenance During Prime Time
Flaky Battery
SAN Technician working without
knowledge
Vendor allowed in data center without
escort
No battery preventive maintenance
schedule
Install second battery
Change process, require SAN Technician to get permisison from SAN group for all work
Change security process, no unescorted vendors
Create process to replace batteries every x months, well in
advance of MTBF
Change Battery Maintenance Process – swap is done off-peak
Service Outage View
Assign to Manager of SAN Group
Assign to Manager of SAN Group
Assign to Building Security Manager
Assign to Manager of SAN Group, drive process internal and external
Assign to Manager of SAN Group
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 46
Document all in the database Communicate:
InternallyExternally
Drive the process to completion
The ProcessDocument and Communicate
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 47
Surprisingly, nobody wants to be a POD!
Actually a good thing…
If your area contributed or caused an outage, you get to be POD.
Incentive not to have outages
Battery Maintenance During Prime Time
Not aware of potential bad battery
SAN Technician working without
knowledge
Vendor allowed in data center without
escort
No battery preventive maintenance
schedule
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 48
The Process - details to work out
How to define an outage? When is the outage over? Who is best to drive this process? How does the process get initiated?
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 49
The Process - details
Eureka!Outage Manager can launch PEAPAssign POD
= manager of group that fixed the outage
Existing Outage Management ProcessExisting outage definitionKnowledge of incidentCommunicates incident status to customers
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 50
ITIL Terminology: Incident: Any event that is not part the standard
operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service.
Problem: unknown underlying cause of one of more incidents.
-from ITIL Foundations by ITpreneurs B.V. 2006
At the end of the Incident Management process, the item is moved to the Problem Management Process
The Process
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 51
The ProcessIncident to Problem Transition
Outage incident:Details in incident tracking systemOutage resolved, incident ticket is solved
Interface to PEAP – Problem Management system:Details transferred to a defect recordDefect assigned to an owner – the PODUpdates to defect record pass back to incident ticket
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 52
Post Event Analysis Processin a nutshell
Outage ends, incident transferred to PEAP PEAP assigned to a manager (POD) POD notified automatically by e-mail POD:
gathers information, determines root causes enters findings in database and internal post-mortem (PEAR) assigns follow-ups as needed (new records created) PEAR sent internally Customer Letter is created, reviewed, sent to affected
customers Corrective actions implemented PEAR reviewed by senior management PEAP solved
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 54
Post Event Analysis Process - Rollout
CollateralPEAR templateCustomer Letter templatesProcess user guide – database navigation, process
steps Education class
Overview of root cause determinationOverview of processNavigating the database
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 55
Post Event Analysis Process - Rollout
Limited scope initially Multi-customer outages > 15 min All multi-customer outages All outages
Quality Management System – central location Process description User Guide PEAR and Customer Letter templates
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 56
Post Event Analysis Process - RolloutChallenges:
1. Process defined in QMS but lengthy “Checklist” with links to QMS section
2. Original process – too many steps Gathered feedback Reduced the number of steps Second round of education – new process, value
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 57
Post Event Analysis Process - Rollout
Challenges:
3. Culture change – gaps in compliance Phased roll-out Re-education Administrative reminders Senior management support
4. Not all root causes identified Weekly reviews with senior management “5 Why’s”
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 59
Benefits
1. Decreased downtime “one-fers” (that aren’t) are identified
across platforms across time spans
2. Increased customer satisfaction Many “customers” of ours are CIOs or IT staff
Explain to their own customers Knowledge of cause and remediation
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 60
Benefits
3. Adjust monitoring focus Identify gaps – component level Identify gaps – end-user experience
4. Fewer outages due to late running maintenance More precise estimates, smaller scopes Avoid effort to complete PEAP
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 62
Trending
Primary Root Cause
Contributing Root Cause 1
Contributing Root Cause 2
Contributing Root Cause 3
Contributing Root Cause 4
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 63
Trending
Root Cause Root Cause Root Cause Root Cause Root Cause
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 64
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Trending
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 65
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Root Cause
Trending
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 66
Root Cause
Trending
Corrective Actions
Major Category, Keyword 1, Keyword 2
Part that failed, Vendor
Customers Impacted, Duration
Applications Impacted
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 68
TrendingWhat can you discover?
Outage Type by Percentage(not actual data)
Hardware 27%
Software 31%
Process 42%
Hardware Process Software
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 69
Trending
Process Issues Time Percentage(not actual data)
Process-IncorrectHours:146
Percentage: 55%
Process-IncompleteHours: 63
Percentage: 23%
Process-Not Follow
Hours: 60Percentage:
22%
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 70
Trending
Hardware Outage Minutes by Month(NOT ACTUAL DATA)
0
10
20
30
40
50
60
70
80
2006-Jan
2006-Mar
2006-Apr
2006-May
2006-Jun
2006-Jul
2006-Aug
2006-Sep
2006-Oct
2006-Nov
2006-Dec
2007-Jan
2007-Feb
2007-Mar
2007-Apr
2007-May
2007-Jun
2007-Jul
2007-Aug
2007-Sep
2007-Oct
2007-Nov
2007-Dec
Min
utes
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 71
Trending
Outages by Weekday(NOT ACTUAL DATA)
4
27 28 28
18 21
90
5
10
15
20
25
30
Sun Mon Tue Wed Thu Fri Sat
Cou
nt
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 72
Trending
Outages by Weekday(NOT ACTUAL DATA)
6 7 7 5 7 4
9 10 117 6
12 11 10
68
4
1 211
0
5
10
15
20
25
30
Sun Mon Tue Wed Thu Fri Sat
Coun
t SoftwareProcessHardware
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 73
Trending
Application A Downtime Minutes(not actual data)
Network-Loose Electrical Plug, 101
Process-Incorrect, 98DB2, 85
W2K Server OS-HIS, 360
Mainframe-SNA Server, 75
M/F-FEP, 140
Process-Incomplete, 30
Server-undetermined outage, 30 Causes < 120 mins, 108
Mainframe-TELNET, 180
Circuit Breaker-UPS, 69
Network-Switch, 69
Process-Not Follow, 39
W2K3 OS-HIS, 135
Windows Server OS-IIS, 252
Network-Circuit, 292
Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.Page 74
Conclusion
Methodology – works for all size IT shops Robust defect-tracking database
critical for large shops smaller scale - standard document, keywords No group per landscape - someone is responsible
Integration of PEAP into workflows Phased roll-out, repeat education Admin to assist with tracking and notifications
Problem manager to ‘own’ the process?
How to categorize keywords– ongoing refinements
Recommended