Click here to load reader
Upload
institute-of-systems-science-national-university-of-singapore
View
955
Download
2
Embed Size (px)
DESCRIPTION
Presented by Ms Mayda Lim, Head of Implementation and Support, Thomson-Reuters at NUS-ISS ITSM CoP on 24 Apr.
Citation preview
Managing a Major Incident Case Study in Thomson Reuters Realtime Technology Operations
Mayda Lim Head of Implementation & Support, Technology Operations 24 April 2014 Version: Final
AGENDA
• Thomson Reuters
• Our Service Management Journey
• Managing a Major Incident
INTRODUCTION Thomson Reuters
INTRODUCTION Thomson Reuters
Trading
Investors
Marketplaces
Governance Risk & Compliance
Large Law Firms
Small Law Firms
General Counsels
Government
Intellectual Property
Scientific & Scholarly Research
Life Sciences
Legal Financial & Risk IP & Science Tax & Accounting
Corporate
Professional
Knowledge Solutions
Government
Reuters News
Media
4
INTRODUCTION Finance & Risk
• We serve more than 40,000 customers and 400,000 end users in over 155 countries with a strong presence in North America and Europe and a growing presence in emerging markets. At least 50,000 customer applications also use our information
• Our customers include energy companies, investment management firms, brokerage houses, industrial conglomerates, the world’s top corporations and the 25 largest global banks
• We have the number 1 or 2 position in every segment we serve. We anticipate strongest growth in Governance Risk & Compliance, Commodities & Energy, Marketplaces, Transactions and Enterprise Content, Buy-side and Corporations and Global Markets.
5
Real-Time Technology Infrastructure • Probably, the largest private commercial
network in the world, delivering news & content to desktops and trading applications across 155 countries
• Connecting to ~250 exchanges with over 7000 non Exchange sources
• 350,000 customer end-points across 50,000 customer sites
• 10 million Reuters Instrument Codes in its head-end database, with Editorial and 3rd party news providing >50,000 stories a day
• Delivery options available - depending upon content and latency needs and geography
• 2.6 Million updates per second
• Real-time, mission critical traffic
6
Client
Client
Client
Exchanges
News
Contributors
Data
Data
Data
Data
Data
Data
SDC SDC SDC SDC SDC
Client Client Client Client Client
Resilience ◘ Resilience is a key aspect of our design,
development and builds ◘ At the shared infrastructure level and at
the Service level Service ◘ Dual System Installations
Automated switching ◘ Dual Power
Either at server/device level or through the use of Power finders
◘ Dual International Communications lines Utilising multiple Telecom
providers ◘ Dual Illumination
Dual Uplinks Dual Receivers
Network Resiliency Topology
Service Management Our Journey in Service Management
SERVICE MANAGEMENT Our Journey in Service Management
Programme’s Transformation Objectives laid out in 2004 : • An organisation with a customer oriented proactive culture • A full implementation of appropriate ITIL Service Management processes in line with
business requirements • Staff fully trained and motivated to provide great customer service • An integrated tool set to provide seamless end to end processing • A single managed source of reliable trusted data
Improve Optimize Automate
9
SERVICE MANAGEMENT ITIL Processes Adoption
10
Process Detail Status
Incident ◘ Severity levels, prioritization framework, escalation procedures, improved data capture, improved
customer communications ◘ Standard Process, Standard tools & Process governance & roles in place
Complete
Problem ◘ Problem classifications, root cause analysis process and problem database ◘ Standard Process, Standard tools, Process governance & roles in place
Complete
Change ◘ Improved risk assessment and reporting, enhanced alignment with assets database ◘ Standard Process, Standard tool, Process governance & roles in place
Complete
Release ◘ Release policy, standardization of release documentation templates and guidelines, improved
resource management via Forward Schedule of Release ◘ Standard Process, Process governance & roles in place
Complete
Capacity ◘ Systems under watch increased, capacity risk dashboard developed ◘ Standard Process, Standard tool, Process governance & roles in place
Complete
SERVICE MANAGEMENT ITIL Processes Adoption
11
Process Detail Status
Configuration ◘ SM tools Rollout following a complete audit, process supported by Change Management. ◘ Standard Process, Standard tools & Process governance & roles in place
Complete
Financial ◘ Technology operation sis fully align to business ◘ Accountability of CTO
Complete
Service Level ◘ Central Sourcing function ◘ Back to back internal and external SLA ◘ Service Target agreed
Complete
Knowledge ◘ Formal Process defined and mapped to tool ◘ Commissioned since Feb 2009
Complete
Business Continuity
◘ Comprehensive Documentations ◘ Perform regular exercises ◘ Reviews and Updates
Complete
SERVICE MANAGEMENT Tools
12
◘ Service Manager ◘ Consolidated Service Desk solution providing
best practices based on industry standards
◘ Incident Management ◘ Problem Management ◘ Inventory & Configuration Mgt ◘ Change Management ◘ Scheduled Maintenance ◘ Request Management ◘ Service Level Agreement Mgt ◘ Contract Management ◘ Diagnostic Aids
AssetCenter ◘ Asset Management solution providing the
greatest depth of procurement, inventory, financial and contract management functionality
◘ Portfolio ◘ Procurement ◘ Financials ◘ Cable & Circuit ◘ Contracts ◘ AssetCenter Web
ITIL Ready Tools ◘ While ITIL processes in their own right can progress an organisation’s maturity and performance. When you
couple this with an ITIL ready toolset major improvements can be noted ◘ An integrated toolset ensures clear process flows, consistency and efficiency
Managing a Major Incident Incident Control Centre (ICC)
What is a Major Incident?
An incident is consider Major when • there is a complete or partial service failure (unavailability) • impact on business is extreme
14
15
What Is An Incident Control Centre (ICC)?
WHAT
• Process called to manage Major Incidents • A focal point accountable for coordinating efforts, ensuring
clear and concise customer communication
ACTIONS
• Communicate with all relevant stakeholders • Communicate effectively and professionally to our
customers • Escalate to the Management team as appropriate • Coordinate diagnosis and recovery • Prioritize key activities • Continuously analyze and minimize service restoration
timeframes • Manage all technical recovery activities through the IRT • Outline resourcing and escalation • Undertake risk and impact assessments • Determine follow-up actions
15
So, What does the typical life-cycle of an ICC look like?
16
17
ICC Attributes
• The ICC operates on a 24 x 7 x 365 basis
• It is essential to escalate appropriately at all times day or night
17
The Benefits Of ICC
• Customer focused
• Consistent approach and methodology
• Effective communication
• Appropriate resource is guided and focused
• Manages Risks associated with Major Incidents
18
What Can Go Wrong If An ICC Is Not Called? • Increased customer pain
• Increased brand damage
• Poorly or incorrectly understood Incidents
• In-appropriate and indeed harmful actions may be initiated
• Poor or no coordination of resources
• Incorrect prioritization
• Poor or no communication
• Inconsistency in approach, management, actions and output
• In simple terms – The situation escalates and creates more damage and pain
19
ICC Process
20
21
ICC DEFCON Levels
21
Defense readiness condition (DEFCON)
Service Alert & Notification
22
Internal Communication External Communication
Key Roles
23
Incident Recovery Team (IRT)
Management Team Meeting (MTM)
Incident Management Group (IMG)
ICC Team Layout
Incident Recovery Team (IRT)
25
When • Whenever an incident severity being upgraded to DEFCON level. • Service impacting incident with unclear recovery path
ROLE
• The IRT are responsible for all technical recovery activities
• It is the IRT’s role to provide and drive the ‘technical solution’
• The team is created at the request of the Incident Manager / Technical
Recovery Manager
• The Technical Recovery Manager will appoint an Incident Recovery Team Lead (IRTL)
• Membership will vary depending upon the nature of the Incident, but will typically have a Incident Recovery Leader and a number of subject matter experts
• The IRTL can change or supplement the team membership
• The IRT meeting will remain open until service is restored
Incident Management Group (IMG)
26
WHEN
• A IMG Meeting is called for all DEFCON levels within 30 minutes of an ICC being initiated
• Meetings will occur hourly thereafter although the frequency can be adjusted with agreement from the Incident Manager
• The IMG will last for no longer than 20 minutes and will be based in the ‘War Room’
ROLE
• Act as a focal point for communication to ensure effective and professional
communication occurs
• Coordinate the activities of the most appropriate staff and teams solving the Incident – ensuring that the IRT has the right skills and leadership in place and that progress is being made as effectively as possible
• The IMG can suggest and make membership changes to the Incident Recovery Team as they feel appropriate (DEFCON 2 and above)
• It is not the IMG’s role to drop into ‘technical solution’ mode – this is the responsibility of the Incident Recovery Team
Management Team Meeting (MTM)
27
WHEN
• Held for ‘Full ICC DEFCON 2’ or ‘Severity 0 DEFCON 1’ • Follows an IMG meeting within 60 minutes of the ICC being initiated • Subsequent meeting times will be agreed with the Incident Manager but
may typically occur hourly thereafter (Only hourly if the incident escalates to DEFCON 1 (Emergency Management Committee))
• The MTM will last for no longer than 20 minutes
ROLE
• Ensure communication to both customers and senior managers
is maintained
• Make decisions based upon information provided by the IMG, providing support and guidance as appropriate, and whenever necessary escalate to the EMC
Stand Down The ICC
• Clear path of recovery
• Service restored
• A conscious and recorded decision will be made to stand down all ICCs
• The Service Alert must be updated to reflect the fact that the ICC has been closed
• Root Cause Analysis (RCA) will be initiated by Problem Management
28
29
ANY QUESTIONS
Connect me @ sg.linkedin.com/pub/mayda-lim/6/a46/b81/ @MaydaLim