12
SESSION 306 Wednesday, November 1, 3:00pm - 4:00pm Track: Improving Service Management Problem Management: A Practical Guide Buff Scott III Principal Consultant, Propoint Solutions, Inc. [email protected] Session Description Problem management is one of two ITSM processes referred to as service resolution and restoration processes; the other is incident management. While incident management is focused on restoring normal IT service operation as quickly as possible, problem management focuses on determining the root cause of one or more incidents, identifying temporary workarounds, and applying permanent fixes so that incidents (and service disruptions) don’t happen again. Join this session and take home a practical approach to problem management for your organization. Let’s transition from "firefighting" to "fire prevention"! Speaker Background Buff Scott III has more than 35 years of experience in the IT industry. He;s a versatile leader with extensive management experience, and he;s an accredited ITIL v3 Expert, ITIL Trainer, and HDI Faculty member. Buff also holds the Certified Information Systems Auditor (CISA) certification and is an International Best Practice co-author of "Problem Management: A Practical Guide". He has presented at numerous local and national IT service management conferences and forums.

Problem Management: A Practical Guide - thinkhdi.com/media/HDIFusion/Files/speaker-handouts/... · Let’s transition from ... s an accredited ITIL v3 Expert, ITIL Trainer, and HDI

Embed Size (px)

Citation preview

SESSION 306 Wednesday, November 1, 3:00pm - 4:00pm

Track: Improving Service Management

Problem Management: A Practical Guide

Buff Scott III Principal Consultant, Propoint Solutions, Inc. [email protected]

Session Description Problem management is one of two ITSM processes referred to as service resolution and restoration processes; the other is incident management. While incident management is focused on restoring normal IT service operation as quickly as possible, problem management focuses on determining the root cause of one or more incidents, identifying temporary workarounds, and applying permanent fixes so that incidents (and service disruptions) don’t happen again. Join this session and take home a practical approach to problem management for your organization. Let’s transition from "firefighting" to "fire prevention"!

Speaker Background Buff Scott III has more than 35 years of experience in the IT industry. He;s a versatile leader with extensive management experience, and he;s an accredited ITIL v3 Expert, ITIL Trainer, and HDI Faculty member. Buff also holds the Certified Information Systems Auditor (CISA) certification and is an International Best Practice co-author of "Problem Management: A Practical Guide". He has presented at numerous local and national IT service management conferences and forums.

Problem Management: A Practical Guide

Propoint Solutions, [email protected]

Session 306

Buff Scott III

Welcome!

Buff Scott [email protected]

• ITIL Expert• EXIN Accredited ITIL Trainer• International Best Practice Author –

“Problem Management: A Practical Guide”• Certified Information Systems Auditor• TIPA Lead Process Assessor

Service Restoration

Duct Tape Isn’t The Answer For Everything!!

Incident Management

• More than restoring services

• Characteristic of high-performing IT organizations

• Eliminate recurring incidents

• Prevent incidents from occurring

• Minimize the impact of incidents and problems when they cannot be prevented

• Logs data used for trending by Problem Management

• Categorizes incidents which aids in appropriate incident and problem assignments

• Prioritizes incidents which triggers problem prioritization

• Links incidents to problems

Incident – An unplanned interruption to the standard operation of a service, or a reduction in the quality of that service

Problem – The underlying cause of one or more incidents

Problem Managementvs.

Incident Management

Problem Management

Change Management

Problem DB

KEDBorKB

Incident Database

Matching

Problem Record

Root Cause

Workaround

Problem Management – Permanent Solutions

Change / Release

Known Error

CI at fault

RFC

Workaround

Problem Management Scope

Reactive Problem Management

Proactive Problem Management

Reactive Problem Management is focused on

solving Problems in response to one or more incidents as they occur

Proactive Problem Management is focused on identifying and solving problems and known errors that

might otherwise be missed, thereby preventing future incidents

• Detection and categorization

o Those activities focused on identifying, logging, and classifying problems

• Investigation and diagnosis

o Those activities focused on identifying root cause and transforming problems

into known errors

• Resolution and recovery

o Those activities focused on identifying, approving, applying, and validating permanent fixes to problems and known errors

• Closure

o Those activities focused on closing problems, known errors and related incidents with updated and reusable information

Problem Management Activities

Triggers for opening a problem record

• There is an incident for which the root cause is not known

• Analysis of an incident by a Support Group reveals a potential underlying problem

• Event and alerting tools automatically create an incident record due to fault detection. This may reveal the need for a problem record.

• A major incident was declared

Reactive Problem Management

• Analysis of incidents over differing time periods reveals a recurring trend, indicating an underlying problem might exist

• Analysis of the IT infrastructure by Support Groups identifies a potential problem

• Analysis results from data mining of the knowledgebase

• Reports generated from application or system software

Proactive Problem Management

Investigation and DiagnosisDefine Problem

Document anyWorkaround

Collect Data

Analyze Data

Perform Root Cause Analysis

Document Conclusion

• Determine what happened

• Determine why it happened (understand causal factors)

• Identify and document a workaround

• Determine the root cause

Ishikawa Diagrams

Kepner & Tregoe

ParetoAnalysis

Fault TreeAnalysis

The four major classifications of root causes:

• Physical causes – components failed

• System errors – software failed

• Human causes – people did something wrong or failed to do something they should have

• Organizational causes – a process, policy, or procedure is in error

Investigation and Diagnosis

Common Root Cause Analysis (RCA) Techniques:

• Brainstorming

• Five “Why’s”

• Chronological Analysis

• Ishikawa Diagrams

• Pareto Analysis

• Kepner-Tregoe

• Fault Tree Analysis

Investigation and Diagnosis

• Focus initially on major incidents or priority 1 incidents

• Identify RCA team based on customer, service or category

• Start with a timeline (chronological analysis)

• Brainstorm and identify all possible causes

• Use Pareto Analysis when data is available to identify the most likely causes

• Post your work for others to see/use

Root Cause Analysis

• Research and identify possible solutions

• Choose a solution

• Test the proposed solution

• Submit a Request For Change (RFC) to Change Management for approval to implement the proposed solution

Resolution & Recovery

• Implement the proposed solution

• Verify the solution corrected the error

• Execute problem prevention activities

• Update the KEDB or knowledge base with resolution information

• Verify that the Problem and Known Error records are updated, correct and complete

• Close the Problem or Known Error records when the change has been implemented and the solution verified (there are no new Incidents related to the Problem)

• Update the status of related open Incidents at the time of Problem and Known Error record closure

• Conduct a post-implementation review for capturing lessons learned to be applied to future Problems

Closure

Problem Management

Major

Activities

Inputs

Incident records, CMDB info,

Knowledgebase (KEDB), reports,

monitoring tool logs, release issues, risk

analysis output

Outputs

Workarounds, Known Errors,

RFCs, Permanent fixes, Closed

Problems & Incidents, Reports

Task

Investigation & Diagnosis ClosureResolution & RecoveryDetection and Classification

Identification and

Recording

Classification and

Resource

Allocation

Investigate and

Diagnose

Determine if an

existing KE or

Problem exists

Assign Problem

Open a Problem

record

Identify required

staffing skill set

Determine &

validate work

arounds

Update Status

and Priority

Investigate

Problem

Match Incidents

and link to KE or

Problem record

Categorize

Problem

Find Root Cause

and identify CI at

fault

Document work

around in

Problem record

Hold Post

Implementation

Review as

Needed

Determine action

to resolve Known

Error record

Test solution

Update Known

Error or Problem

Record with

solution

Submit RFC, if

needed

Implement

permanent fix

Develop

permanent fix

Close Known

Error or Problem

record

Verify error was

corrected

Notify Service

Desk to close

Incidents

Validate records

are complete

Solution

Identification

Solution

Implementation

Problem and Error

Closure

Roles and Responsibilities

The primary roles involved in Problem Management are:

• Problem Analyst – members of Support Groups who are assigned Problems

• Process Owner – owns and maintains the Problem Management process

• Problem Manager – responsible for the day-to-day operation of the Problem Management process

Challenges

• Focusing too much on technology

• Failing to incorporate proactive Problem Management

• Weak interfaces between key processes

• Lack of adequate/quality data capture in Incident Management

• Failure to allocate staff time

• Failure to focus on the right Problems

Keys for Success

• Obtain Senior IT Leadership support

• Establish a vision and purpose

• Identify/quantify ROI

• Have a clearly defined and documented process

• Define and fill roles with the right personnel

• Have an effective Incident Management process

• Roll out Problem Management to a pilot team and then to the rest of the organization

• Choose the right support tools

• Have effective KPI and management reporting

Problem Management – Benefits

A reduction in incident volume

Improved First Call Resolution

Shorter resolution times

Higher availability and reliability of IT services

Higher productivity of the users and IT staff

Increased customer satisfaction with IT

We Need Problem Management!!

“Insanity… doing things the way we've always done them, yet expecting different results."

Einstein/Deming