- 1. Thinking problem management! Perceptions [email_address]
Ronald Bartels
2. The problem between IT and business
- The business problem that must be resolved is the containment
of loss that results from IT outages.
- Containment and not reduction as business is never static and
dramatic changes and company expansion can cause a proportional
increase in outages.
- Business hears that outages are going to reduce and complains
when this does not happen and does not discount expansions or
acquisitions.
- Business needs to be sold on the fact that if the number of
employees doubled and outages increased by only 10% then that is a
good thing.
3. The problem with analogies
- An analogy is like a bucket of water with a hole in it, you can
only carry it so far
4. 5. Right eye Stereo Left eye Parallax Depth Perception Avoid
collisions Accuracy 6. Operations Process (ITIL) Visibilty Metrics
(Dashboards) Trust Secure Available Cost effective Efficient and
effective 7. What the blind see when using binoculars 8.
Questions?
- How do you describe depth to a blind person in terms of
vision?
- How do you describe trust to business in terms of
technology?
9. How do you describe depth to a blind person in terms of
vision?
10. How do you describe trust to business in terms of
technology?
- Non-technology terms (people)
11. Steve Prentice, Gartner
- How computers look is as important as how they perform
12. Operations ITIL 13. Process in the context of service
management? People: Properly aligned with the organisation
andBusiness Processes Process: Transformation to Create real Value
Service Management Technology to enableefficient and
effectivesupport processesacross the enterprise 14. Visibility
Metrics 15. How do you eat an elephant?One bite at a time! First
bite = problem managementWith a large amount of chewing on major
incidents Metrics A full discussion on IT metrics is a large
elephant (in actual fact it is 12 tons) 16. The elephant Customer
satisfaction
- Amount of changes submitted
- Amount of changes in process (meaning the backlog)
- Amount of changes rejected
- Amount of changes implemented
- Amount of emergency changes
- Amount of unauthorized changes
- Amount of changes that exceeded the allowed change window
period
- Amount of failed changes that did not have a back out plan
- Loss associated with failed changes
- Amount of changes implemented on schedule
- Amount of SLAs breached due to a failed change
- Amount of changes that failed during installation
- Amount of changes that caused an incident
- Amount of changes that caused a problem
- Amount of inaccurate Configuration Items (CIs) where the
production CI doesnt match the CI record
- Amount of failed changes due to inaccurate CIs
- Amount of incidents caused by inaccurate CIs
- Amount of unplanned work caused by inaccurate CIs
- Amount of unused licenses or lack of licenses (important ROI
calculation)
- Amount of unauthorized CIs (no corresponding RFC)
1, 2 and 3 tons
- Amount of releases that conformed to the companys Release
Policy
- Amount of releases implemented according to schedule
- Amount of releases implemented late
- Amount of unauthorized CIs in the Definitive Software Library
(DSL)
- Amount of releases that were not tested according to plan
- Amount of emergency releases
17. The elephant Customer satisfaction?
- Amount of calls ( Phone, FAX, email or portal ) to the service
desk
- Amount of calls handed per agent
- Amount of calls handled within SLA targets
- Amount of calls handled that exceeded SLA targets
- Resolution rate during first contact
- Amount of calls escalated due to timing
- Amount of calls escalated due to skills required
- Average time the caller waits in queue
- Amount of known errors resolved
- Amount of RFCs raised by Problem Management
4, 5, 6 and 7 tons
- Amount of incidents resolved within SLA targets for each level
of priority
- Amount of incidents escalated to each level of support
- Average time to resolve incidents by priority
- Amount of incidents incorrectly recorded (Priority, Type,
etc.)
- Amount of incidents incorrectly escalated to the wrong 2 ndline
resource
- Amount of services covered by SLAs
- Amount of SLAs that do not have required Operating Level
Agreements and/or Underpinning Contracts
- Amount of SLA targets at risk
- Business impact of breaches
- Amount of service Complaints
- Amount of service reviews conducted
- Amount of service reviews outstanding
- Amount of service improvement plans (SIPs) opened
- Amount of open tasks from SIPs
18. The elephant Customer satisfaction?
- Service availability expressed using an agreed upon
measure:Availability = Uptime / Time Possible
- Mean time to repair (MTTR)
- Mean Time Between Service Incidents (MTBSI)
- Business Impact of outages
- Amount of services where availability targets were improved,
maintained or decreased
- Amount of services with a continuity plan
- Amount of services without a continuity plan
- Amount of continuity plans tested
- Amount of continuity plans not tested according to schedule
(backlog)
- Amount of open issues raised by testing
- Amount of plans which are high risk
- Amount of plans evaluated as ineffective
- Amount of services with unknown capacity requirements
- Unplanned capacity purchases
- Accuracy of capacity plan
-
- capacity purchases vs. budgeted amounts
- Amount of CIs with performance monitoring
- Actual expenses relative to budget
-
- Amount of services with a known costs
-
- Amount of services reviewed per schedule
-
- Amount of services charged by usage
-
- Amount of IT costs absorbed
-
- Amount of services with a recovery model in testing
-
- Amount of services with a recovery model implemented
8, 9, 10, 11 and 12 tons
- Amount of security incidents opened by severity
- Amount of security incidents closed by security
- Amount of services that have had security reviews
- Amount of security reviews outstanding
- Amount of risks identified
- Amount of risks mitigated to an acceptable level
19. Problem Management
- To minimize the adverse impacts of incidents and to prevent
recurrence of incidents. Problem Management seeks to get to the
root cause and initiate action to remove the error.
- A problem is the unknown, underlying cause of one or more
incidents.
- A known error is when the root cause of a problem is known and
a temporary workaround or alternative has been identified.
20. Problem management activities
- Assistance with the handling of major incidents and providing
quality control
-
- Problem identification and recording
-
- Problem investigation and diagnosis
-
- Error identification and recording
-
- Recording error resolution
-
- Monitoring resolution progress
- Proactive prevention of problems
-
- Providing information to the company
- Obtaining management information from problem data
- Completing major problem reviews
21. Why Problem Management?
- Becoming proactive instead of reactive (fire-fighting)
-
- Reward proactive problem management not reactive fault
fixing
-
- Generate honesty, integrity and transparency
- Delivers real business benefit
-
- Delivers bottom line benefit from the productive use of IT
systems
-
- Service failures are an avoidable cost fixing things quickly
doesnt really benefit the business avoidance is worth more
- Requires IT customer engagement
-
- IT customers should assist in assigning priorities and
estimating costs
-
- Work-arounds change business processes
-
- Creating proactive views results in discussions with IT
customers
22. Problem management dashboard Service catalogue Risk, outage
classification area map Expanded incident lifecycle Time analysis
Last 10 major incidents Heat map Top 10 problems Problem/Incident
breakdown Major incident skyline 23. Major Incidents
- What constitutes a Major incident?
- An incident is any event that is not part of the standard
operation of a service and that causes an interruption or a
reduction in the quality of that service.Incidents are recorded in
a standardized system which is used for documenting and tracking
outages and disruptions.A Major Incident is and unplanned or
temporary interruption of service with severe negative
consequences.Examples are outages involving core infrastructure
equipment/services that affects a significant customer base, such
as isolation of a company site, which is considered a Major
Incident. Any equipment or service outage that does not meet the
criteria necessary to qualify as a Major Incident is by default a
Minor Incident.Major incident reports are escalated to the problem
manager for quality assurance.
There is a close relationship between the major incident process
and problem management. 24. Major Incident template 25. Incident
consequence, grading Taking the blah, blah, blah out of IT. 26.
Major Incident process Incident dd/mm hh:mm Detected hh:mm Repair
hh:mm Recover hh:mm Restore hh:mm Resolution hh:mm Workaround hh:mm
Escalations: Problem management Report: dd/mm Hot ticket Diagnosed
hh:mm Service desk IT Customers Notification/Feedback Hot line ROC
analysis IUM Plan/Declaration Opportunistic Known Potential Major
Progress Normal operations Progress 27. Expanded Incident
lifecycle
Crucial times to record for problem management 28. IUM (Incident
User Metric)
- What is the opportunity cost to the company of 1 minutes outage
based on the effect on productivity? (or put another way, what is
the total salary bill of the company for 1 minute?)
- What was the length of the outage?
- What percentage of the IT customer population was
impacted?
- Is it a lesser multiplier? (Liability, scrutiny by management,
internal process, companys image)
- Length of outage * population impacted * opportunity cost *
(multiplier) =INCIDENT USER METRIC
How big was it really? 29. IUM example Large incidents are
easily visualized 30. How does IUM help?
- Most incidents that effect a significant amount of IT customers
are potential major incidents.What constitutes a major incident and
what does not?
- The key is in the IUM.After a large enough sample pool has been
built (> 10 incidents) the average is calculated.
- Minor incident is an incident where the IUM is less than 20% of
the norm.
- Major incident is an incident where the IUM is greater than 20%
of the norm.
- Normal incident is an incident that is within 20% of the
norm.
Minor, Normal and Major! 31. ROC Analysis (Risk, Outage,
Classification)
-
- BIA Lite (Lightweight business impact analysis)
-
- SOA Lite (Lightweight service outage analysis)
-
- CRAMM Lite (Lightweight risk management)
Full blown process analysis is time consuming so a variant is
using that is quick and easy. 32. CRAMM LITE
- The asset, process or resources involved in the major incident
are measured for a risk perceptive.Three areas are assessed.Each
area has a maximum score of 4 and the Risk is the score of all
areas represented as a percentage.
-
- CIA(Confidentiality, integrity and availability) are
scored.
-
- Loss(C), error(I) and failure(A) are scored
-
- Countermeasures already in place and those that will be
implemented in the future are scored.
The R in ROC analysis 33. Risk Example
- The impact is rated as 4 Critical Confidentiality = Secure,
Integrity = Very high, Availability = Mandatory
- The impact is rated as 4 High loss probability, High error
probability, High failure probability
- Counter measures is rated as 2 Service provider due
diligence.
- The score is thus 10 out of a max of 12 = 84%
34. SOA Lite
- An outage analysis is conducted of the service impacted.Two
areas are assessed.Each area has a maximum score of 4 and service
outage is the score of all areas represented as a percentage.
-
- The measurement is based on elapsed time.
-
- Determined by financial means or business perceptions
The O in ROC analysis 35. Outage example
- The period is rated as 3 - Major - App, server, link (network
or voice) unavailable for greater than 1 hour or degraded for
greater than 4 hours
- The consequence is rated as 2 - Moderate - Financial loss which
impacts the profitability of the business unit, greater than $100k
or embarrassment or reported to regulator or hospitalization.
- The score is thus 5 out of a max of 8 = 63%
36. BIA Lite
- The resultant impact on the company is measured to determine
customer perceptive.Five areas are assessed.Each area has a maximum
score of 4 and the classification is the score of all areas
represented as a percentage.
-
- Percentage of customers affected
-
- Internal and external negative consequences in the company
The C in ROC analysis 37. Classification example
- The scope is rated as 2 less than 25% of customers
affected
- The credibility is rated as 4 Areas outside the company will be
affected negatively
- The operations is rated as 2 - Interferes with normal
completion of work
- The urgency is rated as 3 - Caused by unscheduled change or
maintenance
- The prioritization is rated as 3 High - Technicians respond
immediately, assess the situation, and may interrupt other staff
working low or medium priority jobs for assistance.
- The score is thus 14 out of a max of 20 = 70%
38. How does ROC help?
- After a large enough sample pool has been built the averages
are calculated.
- Lets assume the averages are: Risk = 54%, Outage = 49% and
Classification = 70%.
- A new incident happens where the Risk = 68%, Outage = 40% and
the Classification = 69%.
- The following statement about the incident can be made: The
incident affected the company the same as usual.The outage was less
than normal.The risk was greater than average.
The meaning of ROC 39. Grading the resources used during a major
incident
- Identification and business impact have the resources correctly
identified the major incident and described in the correct level of
detail what happened.Has the correct service impacted been
identified from the service catalogue? Was the business impact
obtained or measured?
- Conditions what were the business, IR or environmental
conditions present during the incident and did the resources
describe these to a suitable level of detail
- Expanded Incident Lifecycle are all the times in the expanded
incident lifecycle recorded and are they realistic.Were these
recorded in the incident reference at the service desk.
- Resolution/ Workaround how suitable was the resolution and was
a workaround implemented to reduce the time the service was
unavailable.
- Classification have the resources correctly classified the
impact to the company and was the incident handled with the correct
level of prioritization
- Outage have the resources recorded and classified the outage
times correctly.
- Risk has a suitable risk assessment of the service, asset and
process been conducted?
- Escalations/ Communications did the resources escalate the
incident and was communicate during the process suitable
The maximum possible score is 32 and the grading is calculated
by totalling up the scores from the 8 different areas and
representing it as a percentage of the maximum. 40. Vendor
management Scoring a vendor's performance
-
- Assessment of service and support, respond to support calls on
first contact, resolves issues during support calls, reports
defects before customer discovers them, provides patches and fixes
that work, and provides accurate documentation.
-
- Assessment of account team, provide clear points of contact for
sales, support and service related issues, responds to questions in
a timely fashion, understands companys specific needs and
requests.
-
- Assessment of technical quality, deliver products that perform
as expected, deliver maintenance services that meet requirements,
assessment of innovation, communicates information regarding new
product timeframes, and provides unique technologies, features,
capabilities.
-
- Assessment of procurement process, meets process deadlines,
meets requested customization to product or service, estimates
statement of work accurately, assessment of billing and
reconciliation, deliver accurate invoices, resolves invoice
discrepancies quickly.
The vendor score consist of four sections each with a maximum
score of 4.The maximum possible score is thus 16 and the vendor
score is represented as a percentage of this maximum possible
score. Each section scores as follows: 0)Unacceptable, 1)Low,
2)Average, 3)High and 4)Exceptional. 41. Vendor evaluation example
Ability to evaluate vendors against each other and over time
periods. 42. Evaluating the 2 ndline escalations/interactions from
the service desk
- The following evaluation provides a means of assessing the
performance of 2 ndline escalations or inetractions from the
service desk
- The purpose is to provide a no broken windows environment, with
detail from the bottom up.
- Each day a random sample of 5 calls is graded by the Service
Desk manager.Each item receives 1 point up to a maximum of 10
points.
- The categories graded are classification, description, actions
and times.
43. Example second line evaluation
- Trending service desk detail
44. Example server desk interaction
- Trending service desk detail
45. Problem management template 46. Infrastructure Team leader
Dashboard
- Weekly dashboard to monitor team progress