Upload
truongkhanh
View
222
Download
3
Embed Size (px)
Citation preview
SESSION 306 Wednesday, November 1, 3:00pm - 4:00pm
Track: Improving Service Management
Problem Management: A Practical Guide
Buff Scott III Principal Consultant, Propoint Solutions, Inc. [email protected]
Session Description Problem management is one of two ITSM processes referred to as service resolution and restoration processes; the other is incident management. While incident management is focused on restoring normal IT service operation as quickly as possible, problem management focuses on determining the root cause of one or more incidents, identifying temporary workarounds, and applying permanent fixes so that incidents (and service disruptions) don’t happen again. Join this session and take home a practical approach to problem management for your organization. Let’s transition from "firefighting" to "fire prevention"!
Speaker Background Buff Scott III has more than 35 years of experience in the IT industry. He;s a versatile leader with extensive management experience, and he;s an accredited ITIL v3 Expert, ITIL Trainer, and HDI Faculty member. Buff also holds the Certified Information Systems Auditor (CISA) certification and is an International Best Practice co-author of "Problem Management: A Practical Guide". He has presented at numerous local and national IT service management conferences and forums.
Problem Management: A Practical Guide
Propoint Solutions, [email protected]
Session 306
Buff Scott III
Welcome!
Buff Scott [email protected]
• ITIL Expert• EXIN Accredited ITIL Trainer• International Best Practice Author –
“Problem Management: A Practical Guide”• Certified Information Systems Auditor• TIPA Lead Process Assessor
Service Restoration
Duct Tape Isn’t The Answer For Everything!!
Incident Management
• More than restoring services
• Characteristic of high-performing IT organizations
• Eliminate recurring incidents
• Prevent incidents from occurring
• Minimize the impact of incidents and problems when they cannot be prevented
• Logs data used for trending by Problem Management
• Categorizes incidents which aids in appropriate incident and problem assignments
• Prioritizes incidents which triggers problem prioritization
• Links incidents to problems
Incident – An unplanned interruption to the standard operation of a service, or a reduction in the quality of that service
Problem – The underlying cause of one or more incidents
Problem Managementvs.
Incident Management
Problem Management
Change Management
Problem DB
KEDBorKB
Incident Database
Matching
Problem Record
Root Cause
Workaround
Problem Management – Permanent Solutions
Change / Release
Known Error
CI at fault
RFC
Workaround
Problem Management Scope
Reactive Problem Management
Proactive Problem Management
Reactive Problem Management is focused on
solving Problems in response to one or more incidents as they occur
Proactive Problem Management is focused on identifying and solving problems and known errors that
might otherwise be missed, thereby preventing future incidents
• Detection and categorization
o Those activities focused on identifying, logging, and classifying problems
• Investigation and diagnosis
o Those activities focused on identifying root cause and transforming problems
into known errors
• Resolution and recovery
o Those activities focused on identifying, approving, applying, and validating permanent fixes to problems and known errors
• Closure
o Those activities focused on closing problems, known errors and related incidents with updated and reusable information
Problem Management Activities
Triggers for opening a problem record
• There is an incident for which the root cause is not known
• Analysis of an incident by a Support Group reveals a potential underlying problem
• Event and alerting tools automatically create an incident record due to fault detection. This may reveal the need for a problem record.
• A major incident was declared
Reactive Problem Management
• Analysis of incidents over differing time periods reveals a recurring trend, indicating an underlying problem might exist
• Analysis of the IT infrastructure by Support Groups identifies a potential problem
• Analysis results from data mining of the knowledgebase
• Reports generated from application or system software
Proactive Problem Management
Investigation and DiagnosisDefine Problem
Document anyWorkaround
Collect Data
Analyze Data
Perform Root Cause Analysis
Document Conclusion
• Determine what happened
• Determine why it happened (understand causal factors)
• Identify and document a workaround
• Determine the root cause
Ishikawa Diagrams
Kepner & Tregoe
ParetoAnalysis
Fault TreeAnalysis
The four major classifications of root causes:
• Physical causes – components failed
• System errors – software failed
• Human causes – people did something wrong or failed to do something they should have
• Organizational causes – a process, policy, or procedure is in error
Investigation and Diagnosis
Common Root Cause Analysis (RCA) Techniques:
• Brainstorming
• Five “Why’s”
• Chronological Analysis
• Ishikawa Diagrams
• Pareto Analysis
• Kepner-Tregoe
• Fault Tree Analysis
Investigation and Diagnosis
• Focus initially on major incidents or priority 1 incidents
• Identify RCA team based on customer, service or category
• Start with a timeline (chronological analysis)
• Brainstorm and identify all possible causes
• Use Pareto Analysis when data is available to identify the most likely causes
• Post your work for others to see/use
Root Cause Analysis
• Research and identify possible solutions
• Choose a solution
• Test the proposed solution
• Submit a Request For Change (RFC) to Change Management for approval to implement the proposed solution
Resolution & Recovery
• Implement the proposed solution
• Verify the solution corrected the error
• Execute problem prevention activities
• Update the KEDB or knowledge base with resolution information
• Verify that the Problem and Known Error records are updated, correct and complete
• Close the Problem or Known Error records when the change has been implemented and the solution verified (there are no new Incidents related to the Problem)
• Update the status of related open Incidents at the time of Problem and Known Error record closure
• Conduct a post-implementation review for capturing lessons learned to be applied to future Problems
Closure
Problem Management
Major
Activities
Inputs
Incident records, CMDB info,
Knowledgebase (KEDB), reports,
monitoring tool logs, release issues, risk
analysis output
Outputs
Workarounds, Known Errors,
RFCs, Permanent fixes, Closed
Problems & Incidents, Reports
Task
Investigation & Diagnosis ClosureResolution & RecoveryDetection and Classification
Identification and
Recording
Classification and
Resource
Allocation
Investigate and
Diagnose
Determine if an
existing KE or
Problem exists
Assign Problem
Open a Problem
record
Identify required
staffing skill set
Determine &
validate work
arounds
Update Status
and Priority
Investigate
Problem
Match Incidents
and link to KE or
Problem record
Categorize
Problem
Find Root Cause
and identify CI at
fault
Document work
around in
Problem record
Hold Post
Implementation
Review as
Needed
Determine action
to resolve Known
Error record
Test solution
Update Known
Error or Problem
Record with
solution
Submit RFC, if
needed
Implement
permanent fix
Develop
permanent fix
Close Known
Error or Problem
record
Verify error was
corrected
Notify Service
Desk to close
Incidents
Validate records
are complete
Solution
Identification
Solution
Implementation
Problem and Error
Closure
Roles and Responsibilities
The primary roles involved in Problem Management are:
• Problem Analyst – members of Support Groups who are assigned Problems
• Process Owner – owns and maintains the Problem Management process
• Problem Manager – responsible for the day-to-day operation of the Problem Management process
Challenges
• Focusing too much on technology
• Failing to incorporate proactive Problem Management
• Weak interfaces between key processes
• Lack of adequate/quality data capture in Incident Management
• Failure to allocate staff time
• Failure to focus on the right Problems
Keys for Success
• Obtain Senior IT Leadership support
• Establish a vision and purpose
• Identify/quantify ROI
• Have a clearly defined and documented process
• Define and fill roles with the right personnel
• Have an effective Incident Management process
• Roll out Problem Management to a pilot team and then to the rest of the organization
• Choose the right support tools
• Have effective KPI and management reporting
Problem Management – Benefits
A reduction in incident volume
Improved First Call Resolution
Shorter resolution times
Higher availability and reliability of IT services
Higher productivity of the users and IT staff
Increased customer satisfaction with IT
We Need Problem Management!!
“Insanity… doing things the way we've always done them, yet expecting different results."
Einstein/Deming