36
Business Continuity Management for Data Centres Dr Robert M Cachia BSc Dott Sc (Milan) FCQI (UK) Director, ISACA Malta Chapter IT Governance Manager Government of Malta Visiting Senior Lecturer University of Malta Joint Event: BCS Malta Section ISACA Malta Chapter 13 th October 2009

Business Continuity Management for Data Centres Robert M Cachia 13-10-2009 Release

Embed Size (px)

DESCRIPTION

Business Continuity Management for Data Centres Robert M Cachia 13-10-2009 Release

Citation preview

Business Continuity Management for Data CentresDr Robert M Cachia BSc Dott Sc (Milan) FCQI (UK)Director, ISACA Malta ChapterIT Governance ManagerGovernment of MaltaVisiting Senior LecturerUniversity of MaltaJoint Event:BCS Malta SectionISACA Malta Chapter13thOctober 2009About myself20+ years experiencePositions: Programmer,Software Engineer,IT Project Manager, IT Quality Manager,Principal Consultant, IT Governance ManagerITexperience: Software House environment IT Service Desk environment Data Centre environmentUniversity of Malta, teaching: e-SCM, ERP, e-CRM Service Quality Assurance Sustainable Logistics & Transportation(to MBA/B.Accty, Dipl. Logistics & Transportation students)Director, ISACA Malta Chapterrobert.m.cachia "at" um.edu.mtTodays agenda (1)BCM for industrial strength, co-located, commercial IT providerseg. Data Centres offering SaaS using Cloud Computing technologiesToday we discuss: e-services, e-business, data centres & IT call centres BCM for Data Centres business, e-services, & trust BCM& IT - economic context threats, vulnerabilities, risk, assets Data Centre outages: business impacts BCM - in Data Centres, for Data Centresrobert.m.cachia "at" um.edu.mtTodays agenda (2)Today we discuss: (continued): a BCM initiative for DCs: its output BCP for DCs: management choices BCM Project/Programme business impact analysis risk assessment BCM DC recovery - business & technical an example disaster - mass cyber-attack, & BCM response structure and content of BS25999 some takeawaysrobert.m.cachia "at" um.edu.mte-Services, e-Business - all from Data Centres & IT Call Centresrobert.m.cachia "at" um.edu.mtBCM for Data Centres: what?BCMi. is a management & business capability to protect value-creating activities of the DCii. supports the Boards objective of: staying in businessiii. points i & ii make BCM immediately a Governance issueBCM -planned processes to: assess, reduce & manage vulnerability to risk plan DC responses in case of adversity exercise preparedness for continued IT delivery during disruption restore normality, and protect DC reputation and goodwillThe time to repair the roof is when the sun is shining (J F Kennedy)robert.m.cachia "at" um.edu.mtBusiness, e-Services, & trustIs she: a Logistics Manager, a Banker, a PA to the CEO, a CEO?in: an Airport, a Bourse, a Telecoms provider, iGaming, e-payment, a Retail chain.... doing B2B or B2C?.. doing e-CRM or e-SCM?e-services need trustrobert.m.cachia "at" um.edu.mtBCM and IT - economic contextEconomy 2.0 & Economy 3.0: digital, networked, global, mobile, locational, 24x7x3651.1 billion Internet users, 2.7 billion mobile phone usersubiquitous VoIP/Skype, e-mail, & calendaring softwaree-Business, e-Banking, e-Payment, e-Governmente-SCM, e-CRM, B2B, B2C -across supply chains, geographies, time-zones, economies & jurisdictionsBCM in DCs is mandated by economic logicrobert.m.cachia "at" um.edu.mtBCM and IT - what not to dohuman error:IT management human error:technicalrobert.m.cachia "at" um.edu.mtDC BCM terminology: Threats, Vulnerabilities, Risk, AssetsThreat the intent and capacity to cause loss or disruptionand create adverse consequences - e.g. to ITservices, data, Data Centers, IT Call Centers etc.Vulnerability the susceptibility of a service provider, service, dataor infrastructure to damage, impairment or exposureby a threatRisk a measure of the potential consequences of acontingency against the likelihood of its occurringthreats + vulnerabilities = riskImpact the consequence/effect of a threat expressed interms of reduction of DC capability, or loss ofbusiness, service, data, etc.Asset an ITserviceelement e.g. software, hardware, ITpeople, datarobert.m.cachia "at" um.edu.mtThreats - some examplesSome data centre threats:human error (management/technical, accidental/malicious) technical failure (mechanical, electrical, hardware, software, etc.)software failure (even software patches themselves contain bugs!)fire / explosion / smokefloods(natural, burst pipes)toxic hazards (effluents, emissions)structural collapse crime (vandalism, organized crime, white-collar crime, etc.)mass cyber-attack, digital terrorism, cyber-warEstablished threats may diminish, and new threats emerge - alertnessOct 2006 - Marsascalarobert.m.cachia "at" um.edu.mtThreats - general, & IT-specific-Explosion, fire, smoke, emissions, effluents in or close to DC maliciousIT Threatrobert.m.cachia "at" um.edu.mtIT Vulnerabilities - some examples (1)DC vulnerabilities may be technical or non-technicale.g. in organizational design, in business processes, systems, or softwareSome IT vulnerabilities:Poor Data Centre site selection, or Data Centre designSPOFs (people, hardware, databases, software)Poor Data Centre procedures, or compliance toprocedure (SFIA HR framework for IT, ITIL)Poor physical access controlPoor software patchinggood physical access controlrobert.m.cachia "at" um.edu.mtIT Vulnerabilities - some examples (2)Poor software testing, especially Web testing (recall CMMi, ISEB)Poor Configuration Management/Version Control (recall ISO 20000, ITIL)Poor Infosec (passwords, biometrics, encryption, anti-virus/malware)(recall ISO 27001, CobIT, RiskIT)Poor Capacity Management/Availability Management (recall ISO 20000, ITIL)Poor procurement/supplier management (ISO 9000, ITIL, BS25999)Poor fire protectionKnown vulnerabilities sometimes diminish & new ones emerge - alertnesswiringfromhellrobert.m.cachia "at" um.edu.mtRisks - general & IT-specificloss of facilities (buildings, roads leading to building, etc.)loss of data (data integrity, data availability; recall ISO 27001)loss of powerloss of water supply (cooling) / the opposite - floodingcomputer infection outbreaks - viruses, wormsphysical access compromised; intrusion with intentasset seizureChanging threats, changing vulnerabilities - shifting scenario: alertnessSome Data Centres risks:supply chain disruption (software vendor collapse, hardware vendor collapse, ISP collapse)loss of Telecom connectivity (data, voice)loss of people (people can be SPOFs!)Your IT vendor supply chain?Burst pipe:DC floodingrobert.m.cachia "at" um.edu.mtIT Risks - some examplesPhysical DC Risks Logical DC RisksDC outages business impacts - large & immediateImpact of DC outages: down-time breached SLAs lost revenue reputational damage.. clientsandprospects lost forever? cost to recover legal liabilities some DCs never recoverRecall: many DCs are inpremises that were neverdesigned to be DC-readyQ: How to protect value-creation capability?A: BCMrobert.m.cachia "at" um.edu.mtBCMin Data Centres, BCM for Data Centres360 e-continuity & e-assurance:business functions of commercial IT providers(e.g. CEOs office, Procurement, Marketing & Sales, HR, Facilities,Planning, QA)IT processes; e.g. Incident/Problem ... Configuration/Change ..Availability/Capacity ..DataEmployees, including IT peoplePhysical FacilitiesBuilding, Networks, VoIPComputational and StorageServers, SAN/NASSoftware stackO/S, DBMS, middleware, applicationsApplications/ServicesB2B/B2C, e-SCM/ERP/e-CRM, BI/KM, Intranets/Extranets, e-mailBCM for data centres: 360 robert.m.cachia "at" um.edu.mtBCM DC initiative output: a BC Plan + organizational capabilityA good BC plan is articulated by components and viewsWith a good BC plan, the CIO/CTO/Data Centre Manager can tell the CEO:"We have/are executing plans by DC client/market segment. ""We have/are executing plans by Application/by Service. " "We have/are executing plans by DC Department. ""We have/are executing plans by Building/by Floor .. "BC Plan / BC Planning: (i) What? (ii) How? (iii) Who? (iv) When?robert.m.cachia "at" um.edu.mtDC Business Continuity Planning: management choicesWhich DC clients/services/applications get restored first, and who willwait longer? recovery sequence for each client?Which services/applications get restored completely, which clients willinitially only get limited service?MTPD maximum tolerable period of disruption; the period of time bywhich all identified systems/applications/services and datamust be restored to normality (for DC, for each client)RTO recovery time objective; the period in time within whichassets/services must be recovered after disruption (for DC, foreach client)RTO affects the recovery option; shorter RTO -> more difficult,more expensive (for DC, for each client)RPO recovery point objective; the point in time to whichassets/services must be restored after disruption (for DC, foreach client)RPO affects the volume of data that may need to be restoredrobert.m.cachia "at" um.edu.mtBCM initiative requires a dedicated Project/ProgrammeA BC initiative may adopt Prince 2 or PMBOK projectmanagement methods. The typical BCM phases:BC Project Initiation Phase(mandate: BCM scope, DC BCM objectives, responsibilities,approach, awareness, and BC project outcomes/deliverables)Business Impact Analysis PhaseRisk Assessment PhaseBC Planning & Design PhaseExercise & Testing PhaseMaintenance & Review PhaseAwareness & Training Phaserobert.m.cachia "at" um.edu.mtBusiness Impact Analysis (BIA) PhaseBIA achieves understanding of the DC; its users, activities,services/revenue streams, applications, data, liabilities BIA takes an internal focus- it identifies critical DC value-creatingactivities and assets BIA quantifies DC loss due to disruption/impairment of assets(reputational, financial, lost revenue, cost of recovery) BIA doesnot estimate the probability of types of DC incidents - itquantifies the consequences BIA techniques: questionnaires + interviewsBIA answers the questions: which DC activities &assets create most value? of all potential DC losses,which impaired activities & assets would be thegreatest loss?cascading impacts .... multiple DC failure?robert.m.cachia "at" um.edu.mtThe Risk Matrix - likelihood vs. impactMajor Risks to DC- top-right square;unacceptable; Eliminate, or: Postpone,Transfer, Monitor & Mitigate; +management ownership; + DO BCM-GovernanceContingency Risks to DC - top-leftsquare; very rare events; willextinguish the business; complacency +overconfidence can distractmanagement from the consequences:DO BCM - GovernanceHigh incidence and low impact risks,&minor risks; order of the day -operational mattersrobert.m.cachia "at" um.edu.mtRisk Assessment PhaseRA considers the "what/why/where/who" can and would cause the DCthe disastrous losses identified in BIA; RA seeks to understand threatsto valued service/DC assets RA takes a primarily external focus; a "world"/"political"economic"/ "IT sector" focus; - it looks at sources and causes RA identifies threat scenarios to DC value-creating activities &assets RA estimates likelihood (probability) of occurrence of DC losses& asset impairment identified in BIA; (estimates orguesstimates)Note: Need to locate DC activities & assets in the slots of the RiskMatrix for the DC, and plan to protect value-creating capabilityrobert.m.cachia "at" um.edu.mtBC Planning & Design Phase - business recoveryDC recovery options: driven by business requirements, regulation, contracts & SLAs.A typical recovery sequence:1. Set MTPD objective for the DC/for each client2. Assign RTOs for individualclient/services/applications/data onbasis of priority/SLA3. Evaluate alternative management & technical recovery options4. Perform cost/benefit analysis for eachoption (re-plan/iterate)5. Decide: CEO/CFO; (i) approval (ii) budget (iii) tasking individuals"Failure to prepare is preparing to fail" - John Woodenrobert.m.cachia "at" um.edu.mtSome business recovery options for the DC: an Emergency Operations Office/Centre? virtual workplace/teleworking from home for selected DC employees? alternative IT Call Centre/Data Centre? business decision: - On-shore, Near-shore, Offshorefallback/redundancy optionsA typical recovery sequence:1. Initial emergency response2. Resume mission-critical Applications/Services3. Resume non-critical Applications/Services4. Restoration to primary site, full services; verify stabilityOngoing communication - DC employees , clients, regulatorsBC Planning & Design Phase - business recovery (2)robert.m.cachia "at" um.edu.mtBC Planning & Design - technical recoveryNote: business decision: On-shore, Near-shore, OffshoreNote: Mirror, Hot, Warm, Cold: (i) own (ii) shared-space (iii) reciprocalagreements (iv) outsourced; a business issueNote: high-end BC for DCs: active-active . hot failover .Note: entry-point BC for DCs: electronic vaulting, remote journaling,transaction loggingMirror full redundancy; short-distance/widely-separated DataCenters; latency considerationsHot fully equipped fallback Data CentreWarm fallback Data Center missing keycomponentsCold empty fallback Data CenterDC Recovery options by increasing MTPD / RTO & decreasing cost:mirroringrobert.m.cachia "at" um.edu.mtPhases: Exercise & Testing; Maintain & Review; Awareness & TrainingMaintain & Review Phasefeedback from testing/audit, broaden/deepen DC BC planAwareness & Training Phaseits never enough: institutionalize BC practice; embed intoDC cultureThe proof of the pudding is in the eatingExercise & Testing PhaseGive yourself an intentional DC disaster!1. Re-test BCP regularly(walk-through, checklist, role-playing/simulation, full interruption &rehearsal)2. Test alternative risk (threat +vulnerability) scenariosrobert.m.cachia "at" um.edu.mtAn example disaster:mass cyber-attackA mass cyber-attack mayinvolve:prior sniffing social engineering, &widespread attackAn attack may be blended:denial-of-service and/orwidespread virus/worm infection and/orpenetration (data theft, data corruption) and/orfraud/blackmailIncident/Problem Management(recall ISO 20000, ITIL)source: M. BenyoucefUniversity of Ottawa robert.m.cachia "at" um.edu.mtAn example DC disaster: BCM response to cyber-attackThis sequence applies to: DoS/DDoS, infection & penetration cyber-attacksNote: After the BCM response, seek prevention: (i) Lessons Learnt (ii)improve DC process/systems resilience (iii) heighten DC emergencypreparedness1. Interdict: actions to block attack: halt its progress2. Contain: actions to prevent further degradation & regaincontrol3. Recover: repair; fixdamage, bring all systems,data, applications/services to normality4. Analyse: investigation/audit - examinationofevidenceWhen prevention fails and an attack is under way, the BCM response is:Interdict, Contain, Recover, Analyserobert.m.cachia "at" um.edu.mtStructure & Content of BS25999A DC may go beyond organising a BCP projectIt may decide to formalise its BCM assurance process by embracing the BS25999 standardBS25999: holistic structure & content: overview of business continuity management (BCM) business continuity management policy BCM programme management understanding the organization determining business continuity strategy developing and implementing a BCM response exercising, maintaining and reviewing BCM arrangements embedding BCM in the organizations cultureBS25999: Holistic robert.m.cachia "at" um.edu.mtSome takeaways (1)BCM thinking & action: decrease the number ofholes; both management &technical decrease the size of theholes, and keep them moving, so theyare never alignedThe Swiss cheese model of business resilience robert.m.cachia "at" um.edu.mtSome takeaways (2)BCM is a concern of:CIOs, CTOs, Data Centre ManagersQuality Managers / Infosec ManagersIT Architects, DBAsIT AuditorsCEOs / CFOs, because its GovernanceBCM addresses broad, non-specific risks; it is a catch allBCM is probably the most cost-effective security control of allBCM: start now, start small, scale fast, broaden/deepen BCM scopeincrementally"You must be able to respond to your circumstances as they exist - not as you would like them to be" - Brian Billickrobert.m.cachia "at" um.edu.mtSources in and around BCM for Data CentresThe Business Continuity Institute (BCI, UK); http://www.thebci.org/Guidance/studies on BCM by: ENISA (EU), ECB (EU), CMI (UK), FSA (UK), UK Cabinet OfficeA primer on ITIL / ISO 20000 - IT Service Managementhttp://www.isaca-malta.org/documents/presentations/IT_Service_Management_Robert_M_Cachia_26_02_2009.pdfAgility & alertness continuitycentral; http://www.continuitycentral.com/uk.htmBS25999 (British Standards Institution); http://www.bsigroup.com/en/sectorsandservices/Disciplines/Business-Continuity/RiskIT & CobIT frameworks (ISACA); www.isaca-malta.org/ SFIAHR framework for IT, ISEB (BCS); http://www.bcs.org/SEI (USA); http://www.sei.cmu.edu/Uptime Institute (USA); http://www.uptimeinstitute.org/robert.m.cachia "at" um.edu.mtQuestions and Discussionrobert.m.cachia "at um.edu.mtThank You!!!Dr Robert M Cachiahttp://www.linkedin.com/in/robertmcachiarobert.m.cachia at um.edu.mt