30

Click here to load reader

Managing a Major Incident

Embed Size (px)

DESCRIPTION

Presented by Ms Mayda Lim, Head of Implementation and Support, Thomson-Reuters at NUS-ISS ITSM CoP on 24 Apr.

Citation preview

Page 1: Managing a Major Incident

Managing a Major Incident Case Study in Thomson Reuters Realtime Technology Operations

Mayda Lim Head of Implementation & Support, Technology Operations 24 April 2014 Version: Final

Page 2: Managing a Major Incident

AGENDA

• Thomson Reuters

• Our Service Management Journey

• Managing a Major Incident

Page 3: Managing a Major Incident

INTRODUCTION Thomson Reuters

Page 4: Managing a Major Incident

INTRODUCTION Thomson Reuters

Trading

Investors

Marketplaces

Governance Risk & Compliance

Large Law Firms

Small Law Firms

General Counsels

Government

Intellectual Property

Scientific & Scholarly Research

Life Sciences

Legal Financial & Risk IP & Science Tax & Accounting

Corporate

Professional

Knowledge Solutions

Government

Reuters News

Media

4

Page 5: Managing a Major Incident

INTRODUCTION Finance & Risk

• We serve more than 40,000 customers and 400,000 end users in over 155 countries with a strong presence in North America and Europe and a growing presence in emerging markets. At least 50,000 customer applications also use our information

• Our customers include energy companies, investment management firms, brokerage houses, industrial conglomerates, the world’s top corporations and the 25 largest global banks

• We have the number 1 or 2 position in every segment we serve. We anticipate strongest growth in Governance Risk & Compliance, Commodities & Energy, Marketplaces, Transactions and Enterprise Content, Buy-side and Corporations and Global Markets.

5

Page 6: Managing a Major Incident

Real-Time Technology Infrastructure • Probably, the largest private commercial

network in the world, delivering news & content to desktops and trading applications across 155 countries

• Connecting to ~250 exchanges with over 7000 non Exchange sources

• 350,000 customer end-points across 50,000 customer sites

• 10 million Reuters Instrument Codes in its head-end database, with Editorial and 3rd party news providing >50,000 stories a day

• Delivery options available - depending upon content and latency needs and geography

• 2.6 Million updates per second

• Real-time, mission critical traffic

6

Client

Client

Client

Exchanges

News

Contributors

Page 7: Managing a Major Incident

Data

Data

Data

Data

Data

Data

SDC SDC SDC SDC SDC

Client Client Client Client Client

Resilience ◘ Resilience is a key aspect of our design,

development and builds ◘ At the shared infrastructure level and at

the Service level Service ◘ Dual System Installations

Automated switching ◘ Dual Power

Either at server/device level or through the use of Power finders

◘ Dual International Communications lines Utilising multiple Telecom

providers ◘ Dual Illumination

Dual Uplinks Dual Receivers

Network Resiliency Topology

Page 8: Managing a Major Incident

Service Management Our Journey in Service Management

Page 9: Managing a Major Incident

SERVICE MANAGEMENT Our Journey in Service Management

Programme’s Transformation Objectives laid out in 2004 : • An organisation with a customer oriented proactive culture • A full implementation of appropriate ITIL Service Management processes in line with

business requirements • Staff fully trained and motivated to provide great customer service • An integrated tool set to provide seamless end to end processing • A single managed source of reliable trusted data

Improve Optimize Automate

9

Page 10: Managing a Major Incident

SERVICE MANAGEMENT ITIL Processes Adoption

10

Process Detail Status

Incident ◘ Severity levels, prioritization framework, escalation procedures, improved data capture, improved

customer communications ◘ Standard Process, Standard tools & Process governance & roles in place

Complete

Problem ◘ Problem classifications, root cause analysis process and problem database ◘ Standard Process, Standard tools, Process governance & roles in place

Complete

Change ◘ Improved risk assessment and reporting, enhanced alignment with assets database ◘ Standard Process, Standard tool, Process governance & roles in place

Complete

Release ◘ Release policy, standardization of release documentation templates and guidelines, improved

resource management via Forward Schedule of Release ◘ Standard Process, Process governance & roles in place

Complete

Capacity ◘ Systems under watch increased, capacity risk dashboard developed ◘ Standard Process, Standard tool, Process governance & roles in place

Complete

Page 11: Managing a Major Incident

SERVICE MANAGEMENT ITIL Processes Adoption

11

Process Detail Status

Configuration ◘ SM tools Rollout following a complete audit, process supported by Change Management. ◘ Standard Process, Standard tools & Process governance & roles in place

Complete

Financial ◘ Technology operation sis fully align to business ◘ Accountability of CTO

Complete

Service Level ◘ Central Sourcing function ◘ Back to back internal and external SLA ◘ Service Target agreed

Complete

Knowledge ◘ Formal Process defined and mapped to tool ◘ Commissioned since Feb 2009

Complete

Business Continuity

◘ Comprehensive Documentations ◘ Perform regular exercises ◘ Reviews and Updates

Complete

Page 12: Managing a Major Incident

SERVICE MANAGEMENT Tools

12

◘ Service Manager ◘ Consolidated Service Desk solution providing

best practices based on industry standards

◘ Incident Management ◘ Problem Management ◘ Inventory & Configuration Mgt ◘ Change Management ◘ Scheduled Maintenance ◘ Request Management ◘ Service Level Agreement Mgt ◘ Contract Management ◘ Diagnostic Aids

AssetCenter ◘ Asset Management solution providing the

greatest depth of procurement, inventory, financial and contract management functionality

◘ Portfolio ◘ Procurement ◘ Financials ◘ Cable & Circuit ◘ Contracts ◘ AssetCenter Web

ITIL Ready Tools ◘ While ITIL processes in their own right can progress an organisation’s maturity and performance. When you

couple this with an ITIL ready toolset major improvements can be noted ◘ An integrated toolset ensures clear process flows, consistency and efficiency

Page 13: Managing a Major Incident

Managing a Major Incident Incident Control Centre (ICC)

Page 14: Managing a Major Incident

What is a Major Incident?

An incident is consider Major when • there is a complete or partial service failure (unavailability) • impact on business is extreme

14

Page 15: Managing a Major Incident

15

What Is An Incident Control Centre (ICC)?

WHAT

• Process called to manage Major Incidents • A focal point accountable for coordinating efforts, ensuring

clear and concise customer communication

ACTIONS

• Communicate with all relevant stakeholders • Communicate effectively and professionally to our

customers • Escalate to the Management team as appropriate • Coordinate diagnosis and recovery • Prioritize key activities • Continuously analyze and minimize service restoration

timeframes • Manage all technical recovery activities through the IRT • Outline resourcing and escalation • Undertake risk and impact assessments • Determine follow-up actions

15

Page 16: Managing a Major Incident

So, What does the typical life-cycle of an ICC look like?

16

Page 17: Managing a Major Incident

17

ICC Attributes

• The ICC operates on a 24 x 7 x 365 basis

• It is essential to escalate appropriately at all times day or night

17

Page 18: Managing a Major Incident

The Benefits Of ICC

• Customer focused

• Consistent approach and methodology

• Effective communication

• Appropriate resource is guided and focused

• Manages Risks associated with Major Incidents

18

Page 19: Managing a Major Incident

What Can Go Wrong If An ICC Is Not Called? • Increased customer pain

• Increased brand damage

• Poorly or incorrectly understood Incidents

• In-appropriate and indeed harmful actions may be initiated

• Poor or no coordination of resources

• Incorrect prioritization

• Poor or no communication

• Inconsistency in approach, management, actions and output

• In simple terms – The situation escalates and creates more damage and pain

19

Page 20: Managing a Major Incident

ICC Process

20

Page 21: Managing a Major Incident

21

ICC DEFCON Levels

21

Defense readiness condition (DEFCON)

Page 22: Managing a Major Incident

Service Alert & Notification

22

Internal Communication External Communication

Page 23: Managing a Major Incident

Key Roles

23

Incident Recovery Team (IRT)

Management Team Meeting (MTM)

Incident Management Group (IMG)

Page 24: Managing a Major Incident

ICC Team Layout

Page 25: Managing a Major Incident

Incident Recovery Team (IRT)

25

When • Whenever an incident severity being upgraded to DEFCON level. • Service impacting incident with unclear recovery path

ROLE

• The IRT are responsible for all technical recovery activities

• It is the IRT’s role to provide and drive the ‘technical solution’

• The team is created at the request of the Incident Manager / Technical

Recovery Manager

• The Technical Recovery Manager will appoint an Incident Recovery Team Lead (IRTL)

• Membership will vary depending upon the nature of the Incident, but will typically have a Incident Recovery Leader and a number of subject matter experts

• The IRTL can change or supplement the team membership

• The IRT meeting will remain open until service is restored

Page 26: Managing a Major Incident

Incident Management Group (IMG)

26

WHEN

• A IMG Meeting is called for all DEFCON levels within 30 minutes of an ICC being initiated

• Meetings will occur hourly thereafter although the frequency can be adjusted with agreement from the Incident Manager

• The IMG will last for no longer than 20 minutes and will be based in the ‘War Room’

ROLE

• Act as a focal point for communication to ensure effective and professional

communication occurs

• Coordinate the activities of the most appropriate staff and teams solving the Incident – ensuring that the IRT has the right skills and leadership in place and that progress is being made as effectively as possible

• The IMG can suggest and make membership changes to the Incident Recovery Team as they feel appropriate (DEFCON 2 and above)

• It is not the IMG’s role to drop into ‘technical solution’ mode – this is the responsibility of the Incident Recovery Team

Page 27: Managing a Major Incident

Management Team Meeting (MTM)

27

WHEN

• Held for ‘Full ICC DEFCON 2’ or ‘Severity 0 DEFCON 1’ • Follows an IMG meeting within 60 minutes of the ICC being initiated • Subsequent meeting times will be agreed with the Incident Manager but

may typically occur hourly thereafter (Only hourly if the incident escalates to DEFCON 1 (Emergency Management Committee))

• The MTM will last for no longer than 20 minutes

ROLE

• Ensure communication to both customers and senior managers

is maintained

• Make decisions based upon information provided by the IMG, providing support and guidance as appropriate, and whenever necessary escalate to the EMC

Page 28: Managing a Major Incident

Stand Down The ICC

• Clear path of recovery

• Service restored

• A conscious and recorded decision will be made to stand down all ICCs

• The Service Alert must be updated to reflect the fact that the ICC has been closed

• Root Cause Analysis (RCA) will be initiated by Problem Management

28

Page 29: Managing a Major Incident

29

ANY QUESTIONS

Page 30: Managing a Major Incident

Connect me @ sg.linkedin.com/pub/mayda-lim/6/a46/b81/ @MaydaLim