Thinking problem management! Perceptions [email protected]

1. Thinking problem management! Perceptions [email_address] Ronald Bartels

The business problem that must be resolved is the containment of loss that results from IT outages.

Containment and not reduction as business is never static and dramatic changes and company expansion can cause a proportional increase in outages.

Business hears that outages are going to reduce and complains when this does not happen and does not discount expansions or acquisitions.

Business needs to be sold on the fact that if the number of employees doubled and outages increased by only 10% then that is a good thing.

An analogy is like a bucket of water with a hole in it, you can only carry it so far

How do you describe depth to a blind person in terms of vision?

How do you describe trust to business in terms of technology?

Non-vision senses

Non-technology terms (people)

Intelligence

Emotion

Physical

Spiritual

How computers look is as important as how they perform

Change management

Amount of changes submitted

Amount of changes in process (meaning the backlog)

Amount of changes rejected

Amount of changes implemented

Amount of emergency changes

Amount of unauthorized changes

Amount of changes that exceeded the allowed change window period

Amount of failed changes that did not have a back out plan

Loss associated with failed changes

Amount of changes implemented on schedule

Amount of SLAs breached due to a failed change

Amount of changes that failed during installation

Amount of changes that caused an incident

Amount of changes that caused a problem

Configuration management

Amount of inaccurate Configuration Items (CIs) where the production CI doesnt match the CI record

Amount of failed changes due to inaccurate CIs

Amount of incidents caused by inaccurate CIs

Amount of unplanned work caused by inaccurate CIs

Amount of unused licenses or lack of licenses (important ROI calculation)

Amount of unauthorized CIs (no corresponding RFC)

Release management

Amount of releases that conformed to the companys Release Policy

Amount of releases implemented according to schedule

Amount of releases implemented late

Amount of unauthorized CIs in the Definitive Software Library (DSL)

Amount of releases that were not tested according to plan

Amount of emergency releases

Service desk

Amount of calls ( Phone, FAX, email or portal ) to the service desk

Amount of calls handed per agent

Amount of work requests

Amount of incidents

Amount of calls handled within SLA targets

Amount of calls handled that exceeded SLA targets

Resolution rate during first contact

Amount of calls escalated due to timing

Amount of calls escalated due to skills required

Average time the caller waits in queue

Problem management

Amount of problems

Amount of known errors

Amount of known errors resolved

Amount of RFCs raised by Problem Management

Incident management

Amount of incidents

Amount of incidents resolved within SLA targets for each level of priority

Amount of incidents escalated to each level of support

Average time to resolve incidents by priority

Amount of incidents incorrectly recorded (Priority, Type, etc.)

Amount of incidents incorrectly escalated to the wrong 2 ndline resource

Service level management

Amount of services covered by SLAs

Amount of SLAs that do not have required Operating Level Agreements and/or Underpinning Contracts

Amount of SLA breaches

Amount of SLA targets at risk

Business impact of breaches

Amount of service Complaints

Amount of service reviews conducted

Amount of service reviews outstanding

Amount of service improvement plans (SIPs) opened

Amount of open tasks from SIPs

Amount of SIPs closed

Availability management

Service availability expressed using an agreed upon measure:Availability = Uptime / Time Possible

Mean time to detect

Mean time to repair (MTTR)

Mean Time Between Service Incidents (MTBSI)

Business Impact of outages

Amount of services where availability targets were improved, maintained or decreased

Service continuity

Amount of services with a continuity plan

Amount of services without a continuity plan

Amount of continuity plans tested

Amount of continuity plans not tested according to schedule (backlog)

Amount of open issues raised by testing

Amount of plans which are high risk

Amount of plans evaluated as ineffective

Capacity management

Amount of services with unknown capacity requirements

Unplanned capacity purchases

Accuracy of capacity plan

capacity purchases vs. budgeted amounts

Amount of CIs with performance monitoring

Financial management

Actual expenses relative to budget

Amount of services with a known costs

Amount of services reviewed per schedule

Amount of services charged by usage

Charge back

Amount of IT costs absorbed

Profitability or surplus

Amount of services with a recovery model in testing

Amount of services with a recovery model implemented

Security management

Amount of security incidents opened by severity

Amount of security incidents closed by security

Amount of services that have had security reviews

Amount of security reviews outstanding

Amount of risks identified

Amount of risks mitigated to an acceptable level

To minimize the adverse impacts of incidents and to prevent recurrence of incidents. Problem Management seeks to get to the root cause and initiate action to remove the error.

Definition

A problem is the unknown, underlying cause of one or more incidents.

A known error is when the root cause of a problem is known and a temporary workaround or alternative has been identified.

Assistance with the handling of major incidents and providing quality control

Problem control

Problem identification and recording

Problem classification

Problem investigation and diagnosis

Error control

Error identification and recording

Error assessment

Recording error resolution

Error closure

Monitoring resolution progress

Proactive prevention of problems

Trend analysis

Targeting support action

Providing information to the company

Obtaining management information from problem data

Completing major problem reviews

Becoming proactive instead of reactive (fire-fighting)

Management support

Reward proactive problem management not reactive fault fixing

Generate honesty, integrity and transparency

Delivers real business benefit

Delivers bottom line benefit from the productive use of IT systems

Service failures are an avoidable cost fixing things quickly doesnt really benefit the business avoidance is worth more

Requires IT customer engagement

IT customers should assist in assigning priorities and estimating costs

Work-arounds change business processes

Creating proactive views results in discussions with IT customers

What constitutes a Major incident?

An incident is any event that is not part of the standard operation of a service and that causes an interruption or a reduction in the quality of that service.Incidents are recorded in a standardized system which is used for documenting and tracking outages and disruptions.A Major Incident is and unplanned or temporary interruption of service with severe negative consequences.Examples are outages involving core infrastructure equipment/services that affects a significant customer base, such as isolation of a company site, which is considered a Major Incident. Any equipment or service outage that does not meet the criteria necessary to qualify as a Major Incident is by default a Minor Incident.Major incident reports are escalated to the problem manager for quality assurance.

Detected

Diagnosed

Repair

Restore

Recover

What is the opportunity cost to the company of 1 minutes outage based on the effect on productivity? (or put another way, what is the total salary bill of the company for 1 minute?)

What was the length of the outage?

What percentage of the IT customer population was impacted?

Is it a lesser multiplier? (Liability, scrutiny by management, internal process, companys image)

Length of outage * population impacted * opportunity cost * (multiplier) =INCIDENT USER METRIC

Most incidents that effect a significant amount of IT customers are potential major incidents.What constitutes a major incident and what does not?

The key is in the IUM.After a large enough sample pool has been built (> 10 incidents) the average is calculated.

Minor incident is an incident where the IUM is less than 20% of the norm.

Major incident is an incident where the IUM is greater than 20% of the norm.

Normal incident is an incident that is within 20% of the norm.

Step 1 Classification

BIA Lite (Lightweight business impact analysis)

Step 2 Outage

SOA Lite (Lightweight service outage analysis)

Step 3 Risk

CRAMM Lite (Lightweight risk management)

The asset, process or resources involved in the major incident are measured for a risk perceptive.Three areas are assessed.Each area has a maximum score of 4 and the Risk is the score of all areas represented as a percentage.

Impact

CIA(Confidentiality, integrity and availability) are scored.

Vulnerability

Loss(C), error(I) and failure(A) are scored

Counter measures

Countermeasures already in place and those that will be implemented in the future are scored.

The impact is rated as 4 Critical Confidentiality = Secure, Integrity = Very high, Availability = Mandatory

The impact is rated as 4 High loss probability, High error probability, High failure probability

Counter measures is rated as 2 Service provider due diligence.

The score is thus 10 out of a max of 12 = 84%

An outage analysis is conducted of the service impacted.Two areas are assessed.Each area has a maximum score of 4 and service outage is the score of all areas represented as a percentage.

Period

The measurement is based on elapsed time.

Consequence

Determined by financial means or business perceptions

The period is rated as 3 - Major - App, server, link (network or voice) unavailable for greater than 1 hour or degraded for greater than 4 hours

The consequence is rated as 2 - Moderate - Financial loss which impacts the profitability of the business unit, greater than $100k or embarrassment or reported to regulator or hospitalization.

The resultant impact on the company is measured to determine customer perceptive.Five areas are assessed.Each area has a maximum score of 4 and the classification is the score of all areas represented as a percentage.

Percentage of customers affected

Credibility

Internal and external negative consequences in the company

Operations

Business interference

Urgency

Time planning

Prioritization

Resource reaction

The scope is rated as 2 less than 25% of customers affected

The credibility is rated as 4 Areas outside the company will be affected negatively

The operations is rated as 2 - Interferes with normal completion of work

The urgency is rated as 3 - Caused by unscheduled change or maintenance

The prioritization is rated as 3 High - Technicians respond immediately, assess the situation, and may interrupt other staff working low or medium priority jobs for assistance.

After a large enough sample pool has been built the averages are calculated.

Lets assume the averages are: Risk = 54%, Outage = 49% and Classification = 70%.

A new incident happens where the Risk = 68%, Outage = 40% and the Classification = 69%.

The following statement about the incident can be made: The incident affected the company the same as usual.The outage was less than normal.The risk was greater than average.

Identification and business impact have the resources correctly identified the major incident and described in the correct level of detail what happened.Has the correct service impacted been identified from the service catalogue? Was the business impact obtained or measured?

Conditions what were the business, IR or environmental conditions present during the incident and did the resources describe these to a suitable level of detail

Expanded Incident Lifecycle are all the times in the expanded incident lifecycle recorded and are they realistic.Were these recorded in the incident reference at the service desk.

Resolution/ Workaround how suitable was the resolution and was a workaround implemented to reduce the time the service was unavailable.

Classification have the resources correctly classified the impact to the company and was the incident handled with the correct level of prioritization

Outage have the resources recorded and classified the outage times correctly.

Risk has a suitable risk assessment of the service, asset and process been conducted?

Escalations/ Communications did the resources escalate the incident and was communicate during the process suitable

Support

Assessment of service and support, respond to support calls on first contact, resolves issues during support calls, reports defects before customer discovers them, provides patches and fixes that work, and provides accurate documentation.

Delivery

Assessment of account team, provide clear points of contact for sales, support and service related issues, responds to questions in a timely fashion, understands companys specific needs and requests.

Operations

Assessment of technical quality, deliver products that perform as expected, deliver maintenance services that meet requirements, assessment of innovation, communicates information regarding new product timeframes, and provides unique technologies, features, capabilities.

Financial

Assessment of procurement process, meets process deadlines, meets requested customization to product or service, estimates statement of work accurately, assessment of billing and reconciliation, deliver accurate invoices, resolves invoice discrepancies quickly.

The following evaluation provides a means of assessing the performance of 2 ndline escalations or inetractions from the service desk

The purpose is to provide a no broken windows environment, with detail from the bottom up.

Each day a random sample of 5 calls is graded by the Service Desk manager.Each item receives 1 point up to a maximum of 10 points.

The categories graded are classification, description, actions and times.

This grading is trended.

Trending service desk detail

Weekly dashboard to monitor team progress