46
Thinking problem management! Perceptions [email protected] onald Bartels

Thinking problem management! Perceptions [email protected]

  • Upload
    billy82

  • View
    1.201

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

  • 1. Thinking problem management! Perceptions [email_address] Ronald Bartels

2. The problem between IT and business

  • The business problem that must be resolved is the containment of loss that results from IT outages.
  • Containment and not reduction as business is never static and dramatic changes and company expansion can cause a proportional increase in outages.
  • Business hears that outages are going to reduce and complains when this does not happen and does not discount expansions or acquisitions.
  • Business needs to be sold on the fact that if the number of employees doubled and outages increased by only 10% then that is a good thing.

3. The problem with analogies

  • An analogy is like a bucket of water with a hole in it, you can only carry it so far

4. 5. Right eye Stereo Left eye Parallax Depth Perception Avoid collisions Accuracy 6. Operations Process (ITIL) Visibilty Metrics (Dashboards) Trust Secure Available Cost effective Efficient and effective 7. What the blind see when using binoculars 8. Questions?

  • How do you describe depth to a blind person in terms of vision?
  • How do you describe trust to business in terms of technology?

9. How do you describe depth to a blind person in terms of vision?

  • Non-vision senses
    • Taste
    • Smell
    • Hear
    • Feel

10. How do you describe trust to business in terms of technology?

  • Non-technology terms (people)
    • Intelligence
    • Emotion
    • Physical
    • Spiritual

11. Steve Prentice, Gartner

  • How computers look is as important as how they perform

12. Operations ITIL 13. Process in the context of service management? People: Properly aligned with the organisation andBusiness Processes Process: Transformation to Create real Value Service Management Technology to enableefficient and effectivesupport processesacross the enterprise 14. Visibility Metrics 15. How do you eat an elephant?One bite at a time! First bite = problem managementWith a large amount of chewing on major incidents Metrics A full discussion on IT metrics is a large elephant (in actual fact it is 12 tons) 16. The elephant Customer satisfaction

  • Change management
  • Amount of changes submitted
  • Amount of changes in process (meaning the backlog)
  • Amount of changes rejected
  • Amount of changes implemented
  • Amount of emergency changes
  • Amount of unauthorized changes
  • Amount of changes that exceeded the allowed change window period
  • Amount of failed changes that did not have a back out plan
  • Loss associated with failed changes
  • Amount of changes implemented on schedule
  • Amount of SLAs breached due to a failed change
  • Amount of changes that failed during installation
  • Amount of changes that caused an incident
  • Amount of changes that caused a problem
  • Configuration management
  • Amount of inaccurate Configuration Items (CIs) where the production CI doesnt match the CI record
  • Amount of failed changes due to inaccurate CIs
  • Amount of incidents caused by inaccurate CIs
  • Amount of unplanned work caused by inaccurate CIs
  • Amount of unused licenses or lack of licenses (important ROI calculation)
  • Amount of unauthorized CIs (no corresponding RFC)

1, 2 and 3 tons

  • Release management
  • Amount of releases that conformed to the companys Release Policy
  • Amount of releases implemented according to schedule
  • Amount of releases implemented late
  • Amount of unauthorized CIs in the Definitive Software Library (DSL)
  • Amount of releases that were not tested according to plan
  • Amount of emergency releases

17. The elephant Customer satisfaction?

  • Service desk
  • Amount of calls ( Phone, FAX, email or portal ) to the service desk
  • Amount of calls handed per agent
  • Amount of work requests
  • Amount of incidents
  • Amount of calls handled within SLA targets
  • Amount of calls handled that exceeded SLA targets
  • Resolution rate during first contact
  • Amount of calls escalated due to timing
  • Amount of calls escalated due to skills required
  • Average time the caller waits in queue
  • Problem management
  • Amount of problems
  • Amount of known errors
  • Amount of known errors resolved
  • Amount of RFCs raised by Problem Management

4, 5, 6 and 7 tons

  • Incident management
  • Amount of incidents
  • Amount of incidents resolved within SLA targets for each level of priority
  • Amount of incidents escalated to each level of support
  • Average time to resolve incidents by priority
  • Amount of incidents incorrectly recorded (Priority, Type, etc.)
  • Amount of incidents incorrectly escalated to the wrong 2 ndline resource
  • Service level management
  • Amount of services covered by SLAs
  • Amount of SLAs that do not have required Operating Level Agreements and/or Underpinning Contracts
  • Amount of SLA breaches
  • Amount of SLA targets at risk
  • Business impact of breaches
  • Amount of service Complaints
  • Amount of service reviews conducted
  • Amount of service reviews outstanding
  • Amount of service improvement plans (SIPs) opened
  • Amount of open tasks from SIPs
  • Amount of SIPs closed

18. The elephant Customer satisfaction?

  • Availability management
  • Service availability expressed using an agreed upon measure:Availability = Uptime / Time Possible
  • Mean time to detect
  • Mean time to repair (MTTR)
  • Mean Time Between Service Incidents (MTBSI)
  • Business Impact of outages
  • Amount of services where availability targets were improved, maintained or decreased
  • Service continuity
  • Amount of services with a continuity plan
  • Amount of services without a continuity plan
  • Amount of continuity plans tested
  • Amount of continuity plans not tested according to schedule (backlog)
  • Amount of open issues raised by testing
  • Amount of plans which are high risk
  • Amount of plans evaluated as ineffective
  • Capacity management
  • Amount of services with unknown capacity requirements
  • Unplanned capacity purchases
  • Accuracy of capacity plan
    • capacity purchases vs. budgeted amounts
  • Amount of CIs with performance monitoring
  • Financial management
  • Actual expenses relative to budget
    • Amount of services with a known costs
    • Amount of services reviewed per schedule
    • Amount of services charged by usage
  • Charge back
    • Amount of IT costs absorbed
    • Profitability or surplus
    • Amount of services with a recovery model in testing
    • Amount of services with a recovery model implemented

8, 9, 10, 11 and 12 tons

  • Security management
  • Amount of security incidents opened by severity
  • Amount of security incidents closed by security
  • Amount of services that have had security reviews
  • Amount of security reviews outstanding
  • Amount of risks identified
  • Amount of risks mitigated to an acceptable level

19. Problem Management

  • Goal
  • To minimize the adverse impacts of incidents and to prevent recurrence of incidents. Problem Management seeks to get to the root cause and initiate action to remove the error.
  • Definition
  • A problem is the unknown, underlying cause of one or more incidents.
  • A known error is when the root cause of a problem is known and a temporary workaround or alternative has been identified.

20. Problem management activities

  • Assistance with the handling of major incidents and providing quality control
  • Problem control
    • Problem identification and recording
    • Problem classification
    • Problem investigation and diagnosis
  • Error control
    • Error identification and recording
    • Error assessment
    • Recording error resolution
    • Error closure
    • Monitoring resolution progress
  • Proactive prevention of problems
    • Trend analysis
    • Targeting support action
    • Providing information to the company
  • Obtaining management information from problem data
  • Completing major problem reviews

21. Why Problem Management?

  • Becoming proactive instead of reactive (fire-fighting)
    • Management support
    • Reward proactive problem management not reactive fault fixing
    • Generate honesty, integrity and transparency
  • Delivers real business benefit
    • Delivers bottom line benefit from the productive use of IT systems
    • Service failures are an avoidable cost fixing things quickly doesnt really benefit the business avoidance is worth more
  • Requires IT customer engagement
    • IT customers should assist in assigning priorities and estimating costs
    • Work-arounds change business processes
    • Creating proactive views results in discussions with IT customers

22. Problem management dashboard Service catalogue Risk, outage classification area map Expanded incident lifecycle Time analysis Last 10 major incidents Heat map Top 10 problems Problem/Incident breakdown Major incident skyline 23. Major Incidents

  • What constitutes a Major incident?
  • An incident is any event that is not part of the standard operation of a service and that causes an interruption or a reduction in the quality of that service.Incidents are recorded in a standardized system which is used for documenting and tracking outages and disruptions.A Major Incident is and unplanned or temporary interruption of service with severe negative consequences.Examples are outages involving core infrastructure equipment/services that affects a significant customer base, such as isolation of a company site, which is considered a Major Incident. Any equipment or service outage that does not meet the criteria necessary to qualify as a Major Incident is by default a Minor Incident.Major incident reports are escalated to the problem manager for quality assurance.

There is a close relationship between the major incident process and problem management. 24. Major Incident template 25. Incident consequence, grading Taking the blah, blah, blah out of IT. 26. Major Incident process Incident dd/mm hh:mm Detected hh:mm Repair hh:mm Recover hh:mm Restore hh:mm Resolution hh:mm Workaround hh:mm Escalations: Problem management Report: dd/mm Hot ticket Diagnosed hh:mm Service desk IT Customers Notification/Feedback Hot line ROC analysis IUM Plan/Declaration Opportunistic Known Potential Major Progress Normal operations Progress 27. Expanded Incident lifecycle

  • Detected
  • Diagnosed
  • Repair
  • Restore
  • Recover

Crucial times to record for problem management 28. IUM (Incident User Metric)

  • What is the opportunity cost to the company of 1 minutes outage based on the effect on productivity? (or put another way, what is the total salary bill of the company for 1 minute?)
  • What was the length of the outage?
  • What percentage of the IT customer population was impacted?
  • Is it a lesser multiplier? (Liability, scrutiny by management, internal process, companys image)
  • Length of outage * population impacted * opportunity cost * (multiplier) =INCIDENT USER METRIC

How big was it really? 29. IUM example Large incidents are easily visualized 30. How does IUM help?

  • Most incidents that effect a significant amount of IT customers are potential major incidents.What constitutes a major incident and what does not?
  • The key is in the IUM.After a large enough sample pool has been built (> 10 incidents) the average is calculated.
  • Minor incident is an incident where the IUM is less than 20% of the norm.
  • Major incident is an incident where the IUM is greater than 20% of the norm.
  • Normal incident is an incident that is within 20% of the norm.

Minor, Normal and Major! 31. ROC Analysis (Risk, Outage, Classification)

  • Step 1 Classification
    • BIA Lite (Lightweight business impact analysis)
  • Step 2 Outage
    • SOA Lite (Lightweight service outage analysis)
  • Step 3 Risk
    • CRAMM Lite (Lightweight risk management)

Full blown process analysis is time consuming so a variant is using that is quick and easy. 32. CRAMM LITE

  • The asset, process or resources involved in the major incident are measured for a risk perceptive.Three areas are assessed.Each area has a maximum score of 4 and the Risk is the score of all areas represented as a percentage.
  • Impact
    • CIA(Confidentiality, integrity and availability) are scored.
  • Vulnerability
    • Loss(C), error(I) and failure(A) are scored
  • Counter measures
    • Countermeasures already in place and those that will be implemented in the future are scored.

The R in ROC analysis 33. Risk Example

  • The impact is rated as 4 Critical Confidentiality = Secure, Integrity = Very high, Availability = Mandatory
  • The impact is rated as 4 High loss probability, High error probability, High failure probability
  • Counter measures is rated as 2 Service provider due diligence.
  • The score is thus 10 out of a max of 12 = 84%

34. SOA Lite

  • An outage analysis is conducted of the service impacted.Two areas are assessed.Each area has a maximum score of 4 and service outage is the score of all areas represented as a percentage.
  • Period
    • The measurement is based on elapsed time.
  • Consequence
    • Determined by financial means or business perceptions

The O in ROC analysis 35. Outage example

  • The period is rated as 3 - Major - App, server, link (network or voice) unavailable for greater than 1 hour or degraded for greater than 4 hours
  • The consequence is rated as 2 - Moderate - Financial loss which impacts the profitability of the business unit, greater than $100k or embarrassment or reported to regulator or hospitalization.
  • The score is thus 5 out of a max of 8 = 63%

36. BIA Lite

  • The resultant impact on the company is measured to determine customer perceptive.Five areas are assessed.Each area has a maximum score of 4 and the classification is the score of all areas represented as a percentage.
  • Scope
    • Percentage of customers affected
  • Credibility
    • Internal and external negative consequences in the company
  • Operations
    • Business interference
  • Urgency
    • Time planning
  • Prioritization
    • Resource reaction

The C in ROC analysis 37. Classification example

  • The scope is rated as 2 less than 25% of customers affected
  • The credibility is rated as 4 Areas outside the company will be affected negatively
  • The operations is rated as 2 - Interferes with normal completion of work
  • The urgency is rated as 3 - Caused by unscheduled change or maintenance
  • The prioritization is rated as 3 High - Technicians respond immediately, assess the situation, and may interrupt other staff working low or medium priority jobs for assistance.
  • The score is thus 14 out of a max of 20 = 70%

38. How does ROC help?

  • After a large enough sample pool has been built the averages are calculated.
  • Lets assume the averages are: Risk = 54%, Outage = 49% and Classification = 70%.
  • A new incident happens where the Risk = 68%, Outage = 40% and the Classification = 69%.
  • The following statement about the incident can be made: The incident affected the company the same as usual.The outage was less than normal.The risk was greater than average.

The meaning of ROC 39. Grading the resources used during a major incident

  • Identification and business impact have the resources correctly identified the major incident and described in the correct level of detail what happened.Has the correct service impacted been identified from the service catalogue? Was the business impact obtained or measured?
  • Conditions what were the business, IR or environmental conditions present during the incident and did the resources describe these to a suitable level of detail
  • Expanded Incident Lifecycle are all the times in the expanded incident lifecycle recorded and are they realistic.Were these recorded in the incident reference at the service desk.
  • Resolution/ Workaround how suitable was the resolution and was a workaround implemented to reduce the time the service was unavailable.
  • Classification have the resources correctly classified the impact to the company and was the incident handled with the correct level of prioritization
  • Outage have the resources recorded and classified the outage times correctly.
  • Risk has a suitable risk assessment of the service, asset and process been conducted?
  • Escalations/ Communications did the resources escalate the incident and was communicate during the process suitable

The maximum possible score is 32 and the grading is calculated by totalling up the scores from the 8 different areas and representing it as a percentage of the maximum. 40. Vendor management Scoring a vendor's performance

  • Support
    • Assessment of service and support, respond to support calls on first contact, resolves issues during support calls, reports defects before customer discovers them, provides patches and fixes that work, and provides accurate documentation.
  • Delivery
    • Assessment of account team, provide clear points of contact for sales, support and service related issues, responds to questions in a timely fashion, understands companys specific needs and requests.
  • Operations
    • Assessment of technical quality, deliver products that perform as expected, deliver maintenance services that meet requirements, assessment of innovation, communicates information regarding new product timeframes, and provides unique technologies, features, capabilities.
  • Financial
    • Assessment of procurement process, meets process deadlines, meets requested customization to product or service, estimates statement of work accurately, assessment of billing and reconciliation, deliver accurate invoices, resolves invoice discrepancies quickly.

The vendor score consist of four sections each with a maximum score of 4.The maximum possible score is thus 16 and the vendor score is represented as a percentage of this maximum possible score. Each section scores as follows: 0)Unacceptable, 1)Low, 2)Average, 3)High and 4)Exceptional. 41. Vendor evaluation example Ability to evaluate vendors against each other and over time periods. 42. Evaluating the 2 ndline escalations/interactions from the service desk

  • The following evaluation provides a means of assessing the performance of 2 ndline escalations or inetractions from the service desk
  • The purpose is to provide a no broken windows environment, with detail from the bottom up.
  • Each day a random sample of 5 calls is graded by the Service Desk manager.Each item receives 1 point up to a maximum of 10 points.
  • The categories graded are classification, description, actions and times.
  • This grading is trended.

43. Example second line evaluation

  • Trending service desk detail

44. Example server desk interaction

  • Trending service desk detail

45. Problem management template 46. Infrastructure Team leader Dashboard

  • Weekly dashboard to monitor team progress