Ready, Set, Plan Defeat Disaster it is Your Move · Ready, Set, Plan Defeat Disaster it is Your...

Preview:

Citation preview

Ready, Set, Plan

Defeat Disaster it is Your Move

Richard Dolewski

Gateway/400November 08, 2007

Disaster Recovery Planning

Only 70% of today’s businesses have fully documented Disaster Recovery Plans.

Of these company’s with plans

Pre 91164% NEVER test their plan

Post 911 30% NEVER test their plan

Common Misconceptions

It will never happen to me!Business as usual after a disasterWe have special requirementsToo many other priorities

Murphy’s law: Disaster strikes when, where and because you are not prepared

Common Issues

We all tend to let our guard down when times improve

As Planners we must always be ready & be prepared.

We are all not Safe from Weather Related Disasters !!!

Impact on Cost of Downtime

Tangible Costs

• Lost Revenue• Lost Wages• Lost Inventory• Regulatory Violations• Legal Fees

Intangible Costs

• Lost Opportunity• Employee Retention• Goodwill• Brand Damage• Customer Respect

Average Cost Per Hour of Downtime —By Industry

Finance: Brokerage Operations $3.15 Million

Finance: Credit Card Auth. $2.1 Million

Online Retail: $113,000

Communications: Internet Provider $90,000

Transportation: $89,500

Media: Ticket Sales $69,000

Transportation: Package Shipping $28,000Source: Contingency Planning Research 2005

No Disaster Recovery Plan

• Guarantees:• Confusion

• Lack of direction

• Conflict

• Lost Customers

Definition of a Disaster

A sudden, unplanned event that causes great damage and loss to an organization.

The time factor determines whether an interruption in service is an inconvenience or a disaster. The time factor

varies from organization to organization.

What is Disaster Recovery

Reaction to a sudden, unplanned event that enables an organization to continue critical

business functions until normal business operations resume.

“…It is not enough to arrange for hardware replacement;… planning must address continuation of

business operations, or business continuation.”

Consider the Business impact of Down time !

Why is this area “ Vital ” ?

Expectations of the Services are demanding

Technology is an enabler of business

Penalties are becoming more severe

Business is becoming more competitive

Can serve as both a source of competitive advantageas well as competitive disadvantage

Vulnerability IT Assessment

Key Steps to DR Planning

IT Capabilities Assessment

– Overview our current IT capabilities– Align the Business Needs– List the gaps between the Business needs and current solutions

– What solutions are needed to bridge the gap

Vulnerability Methodology – BP Audit

Objective is to drive down the duration of outages

A systematic approach towards:

Reducing the frequency of outages by eliminating all single points of failure.

Reducing the duration of outages by configuring both hardware and software for the fastest possible recovery.

Vulnerability Methodology

Best Practices Audit

Analyze

Identify Potential Exposures

Provide Alternatives & Solutions

Implement Solutions

Power Redundancy

Physical Security

Open Door PolicyEntry points – Door ManagementCipher locksIP Cameras

Save/Restore Strategy

System saves must be reviewed Ensure compete recovery is possible from mid week,mid day or weekend failure. Electronic notification of exceptions Review Restoration Procedures Backup software BRMS

Save/Restore Strategy

Partial saves are used because of shrinking backup window.

Save While Active may be the solution for you.

Introduce faster tape technology

Less then 50% of companies have complete backups

Reliable Backups

Backups are the backbone to any recovery situation

– In most recovery situations, the backups are not adequate

– Excessive time is spent recreating parts of operating system

– System State is typically not complete

– QUSRSYS

Design & Test Your Backups

Testing your recovery strategy ensures you have a good back up strategy!

–Your backup is only as good as your recovery–Your recovery is only as good as your backups

Hint:

Design recovery strategy before your backup strategy

Checklist for Backup & Recovery

Examine current save strategy for all mission critical servers.

Map out how you would rebuild multiple servers. Is there a specific order required. Consider enterprise recovery.

Check the backup logs. Missing objects, folders, directories.

Examine Backup software : Veritas, ArchSrv, BRMS, TSM logging

Tape Management Software

Strategic Backup Management Product

Manages your mediaAutomates your BackupsSimplifies your Restores and RecoveriesProvides Detailed Reporting

..…and more

Business Impact Analysis

Mission Statement

IT Services Mandate:

To protect systems from risk.

To ensure continuing confidence

To monitor and protect corporate computing assets.

Protect Important Assets

Four Primary Assets needed to operate Information Systems:

Hardware and Networks can be replaced

Facilities can be rebuilt or relocated

Data is Priceless !!!

Business & IT DR Planning

– Defining Business Objectives– Prioritizing Business Objectives– Overview your current IT capabilities– Alignment between IT and the Business– Minimize the gap between Business needs & IT deliverables– What solutions are needed to bridge the gap– Acceptable length of downtime - High Level

3 Steps to Business Preparedness

1) PLAN to stay in business

2) TALK to your Business & IT Folks

3) PROTECT your investment

Key Steps to DR Planning

Business Impact Costs

– Create costs estimates for each agreed upon risk scenario– Define acceptable amount of downtime– Define acceptable amount of data loss– Create budget costs to implement agreed upon solution

Business Impact Analysis

A summary of critical IT applications.

- Application name- Application priority- Special Requirements- Maximum outage (hours, days)

These applications to be included in the BIA presentation and report.

Business Impact Analysis

Define cost of outageTotal revenueCustomer baseFines and penalties

Graphic representation of revenue lossDefine recovery sequence of the vital processesReview loss per hour, day, week and/or monthObtain senior management confirmation

How much money would your company lose if a major outage occurred?

Questions for the Business

Include: Manufacturing, Finance, Purchasing, Sales,

Warehousing.

What Services do you provide them !

What Services do they provide !

Risk Analysis

Key Steps to DR Planning

Identify Risks

– Identify scenarios where a recovery is required– Identify key business requirements necessary– During these potential interruption– Incident or Disaster

Objectives of a Risk Analysis

Answer: Four basic questions. . .

1. What could go wrong? Threat/Event2. How often can it happen? Frequency3. What will be the consequences? Impact4. How certain are the answers above? Confidence

Statement of Risk

QuantitativeAssigning values, such as $$$$ to something

Identifying the cost of a particular effect, incident or phenomenonALE - Annualized Loss ExposureObjective

Annual Loss Exposure

RISK = FREQUENCY times EXPOSURE R=f*eWhere - f = FREQUENCY

e = EXPOSURE

EXAMPLE = POWER FAILUREFrequency = 5 times a yearResult = Uncontrolled loss of $ 70000 ( Dept A)

= Uncontrolled loss of $ 10000 ( Dept B)5 x $ 70000 =$ 350, 0005 x $ 10000 =$ 100, 000

Produces a Total ALE=$ 450,000

Recovery Options

Key Steps to DR Planning

Where will you go

Determine your recovery site• Internal alternative• Commercial Hotsite• Hosted High Availability ( Internal or Commercial )• Location• Geographic Separation

The Disaster Recovery Challenge

Resume Time Sensitive Business Operations with NOwarning and:

At another (remote?) location/facilitySmaller server with less capacity & capabilityUsing only information stored off siteWithin a designated recovery time objectiveWithout some key personnel

Recovery Time Objective

The time within which Business Processes must be Restored at acceptable Levels of Operational

Capability to Minimize the Impact of an outage.

Point ofDisruption

Resumptionof

operations(Businessor Data

Processing)

Time-SensitiveSystems

Operationalwith Current &Accurate Data

TimeBusinessProcesses

Functional

RTO

Recovery Time Objective

Recovery Tasks Time to Complete Task in hours Assess the disaster situation 3 hours Declare a disaster 2 hours Retrieve tapes from our offsite supplier

1 hour

Transport key staff & backup tapes to the recovery site

2 hours***

Restore all Mission Critical Servers

20 hours

Configure and redirect networks 1 hour Apply incremental data 2 hours Testing & validation 1 hour Total Time 32 hours

Disaster Recovery Hotsite

High Availability

Disk Protection

Transaction Integrity

Tape Backup

Backup & Recovery Hierarchy

Risk Management

Data Resiliency Level

Vender Exercise

BULL**** METER

Evaluating Hotsite Vendors

Location, Location, LocationProximity to public transportationSeparate Power Grid & CO

FacilityAppropriate computing hardwareCompatible communication networksAdequate workspace

Support Service

Evaluating Hotsite Vendors

Support Staff & AvailabilityExperienceTest TimeNumber of CustomersCostAdditional servicesDeclaration feesWhat’s included and what’s not ?

Recovery Solution

Hotsite vs. High Availability

Depends on Recovery Time & Recovery PointBusiness Objectives

IBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Cartridge System TapeIBM Enhanced CapacityCartridge System Tape

Downtime vs. Availability

Downtime Cost Variable is $/Hour

Understanding downtime $/Hour is the most important key to understanding your availability requirements.Labor costs, loss productivity & revenueThe cost of downtime continues to rise.The cost of computing is falling.

Building a TeamMeans getting the right people

Perspective

The difference between a GREAT recovery team and the one that falls down on the job is the:

Caliber of the Team members !

Heroes Step Forward

Perspective

To often companies populate their DR teams with raw inexperienced staffers

Volunteers to satisfy an auditor or worse the sacrificial lamb

The Right People

Not only are these folks leaders, and the most capable:

They are trusted

Confident

Able to correct mistakes

Dedicated to the success of the team

Characteristics of a Good Team

Ideal Characteristics Characteristic to Avoid

Considered an Expert by his/her peers

Hands off Individual ( Avoids Work )

A go-to Person for anything and/or Everything

New to the Organization

Totally unfamiliar with the systems

Works well under Pressure Folds under Pressure

Controls Emotions Hot Head

Ideal Characteristics Characteristics to Avoid

Confident Lacks sense of Urgency

Trusted by Peers Tendency to blame others

Excuses , Excuses , Excuses

Totally unfamiliar with the systems

Dedicated – A company person Pure 9 – 5 er.

First one out the DoorWilling to fix problems created by others

No where to be found

Characteristics of a Good Team

Roles & Responsibilities

Key Steps to DR Planning

Roles

– Educate staff on their roles in the DR plan– Clearly state expectations in a disaster situation– Who is in Charge– Pre-define methods you will utilize to contact staff

Types of Teams

IT Management Team

Executive Management Team

Damage Assessment Team

Media Relations Team

Recovery Management Team

Technical Recovery Team

Platform Recovery Team: iSeries RecoverypSeries RecoveryIntel Server RecoveryUnix/Linux RecoveryNetwork RecoveryApplications Team Security

Insurance Recovery team Site Restoration Facilities Build Team

Types of Teams

The Role of Executives

Executives are not typically involved and should NOT be. Why:

Executive Pressure is hindering

Intimidating

Often Nasty

Glorified Experts

The team must provide regular status reports & the Executives should be accessible if required.

Team Building

IT Recovery Team:

Initial AssessmentFacilities Recovery and Restoration ( Hardware Specific ? )Communications Recovery Teams (voice & data) Data Processing Functional RecoveryVital Records (Off site storage )

IT Recovery Team Role

Recovery includes iSeries/400, and all mission critical IntelServers, Applications, and Communications.

Responsible for initiating damage assessment, recoveryactions, notification procedures until such time as one of the Senior Executive is available.

All reporting will be flow to the to the IT ManagementTeam.

IT Team Leader Role

This individual needs to be technically skilled

Have a strong background in all the server hardware, softwareand complete IT infrastructure.

Communicate with vendor technical reps and hardwareengineers, performance issues, hardware problem resolution,and interfacing with the management team.

Be able to schedule and manage people.

Responsible for initiating damage assessment activities, recovery actions, notification procedures.

Disaster Recovery Teams

DISASTER RECOVERY

MANAGEMENT TEAM Pri: ________________ Alt:

ERP APPLICATION RECOVERY TEAM

Pri:

JDE APPLICATION RECOVERY TEAM

Pri: __________________ Alt:

HELP DESK TEAM Pri: __________________ Alt:

Administration - HUMAN RESOURCES - CLAIMS - ADMINISTRATION - INSURANCE - REGULATORY

Unix

Pri:______ Alt:______

AS/400

Pri:______ Alt:______

AIX Pri:______ Alt:________

Network LAN

Pri:______ Alt:_______

Selecting Plan Manager

Designate a DR Plan Manager - DR Coordinator to manage the DR initiative.

The DR Plan Manager - Act as a focal point for the project.

Organize, plan, and facilitate the development of DR plan based on the prioritization from the Business.

Comply with standards and utilize the methodology for recovery plan development, maintenance and testing.

Plan Manager Activities

Provide DR Plan maintenance activities.

Air Travel Arrangements and Hotel

Tape Media Arrangements / Air or Ground Cargo

Assist in detailed damage assessment and insurance.

Co-ordinate HR to provide counseling for staff or family.

Coordinate food and sleeping arrangements

Coordinate testing Activities

The role of the Plan Manager during a Technical test is to:

Manage the conduct of the test.Develop Table Top ScenariosEnsure that each objective is fully realized.Ensure that each test participant follows the procedures.Record problems and their resolutions as they arise.Record the duration of each of the procedures.Liaison with the Hot Site staff.

Plan Manager Activities

The Plan Manager is also responsible for writing the summary report for the test.

Review the objectives of the technical test. Summarize the changes for the DR Plan and distribute.Summarize any recommendations resulting from the test.Post Test meeting with Participants.State the schedule for the next test.

Plan Manager Activities

How Disasters effect Staff& the caring

Regional Disaster Elements

Recovery is only possible if someone is available to put IT back together again.

Equipment may be accessible, but your recovery will be ineffective if your IT staff cannot access the recovery site.

Key Personnel are often displaced or unavailable during a major regional disaster

Disasters effect people in unpredictable ways

It can devastate peopleCan effect others around themMake the individual unable to functionEmotional breakdown

Respect the situation

The effects are real

Best Practices During a Disaster

Alleviate recovery team workloads with support staffOrganize communications so that team members have only one person to report toEliminate one on one updates Let your staff do what they do best

Executives make strategic decisionsMangers co-ordinate staff & resourcesTechnicians fix the problems

Ensure Team Leaders are sensitive to team members personal needsEnsure team members families are cared for Locate missing team members and notify others of their where aboutProvide support services to the families of injured staffBroadcast all positive accomplishments

Best Practices During a Disaster

Feed your staff - Have snacks handyStay away from High Energy DrinksEnforce no shift duration to exceed 12 hoursEncourage people to take breaks

Provide distractions during breaks

Best Practices During a Disaster

Family Comes First

Until the basic personal needs are met

Family comes first. Always !!

Staff members will not focus on the Enterprise Recovery

Staff members may or will not be available

79

Family Comes First

Recovery Plans must provide for family needs as well as staff members. Offer the basic needs.

Ensure your organization demonstrates they careabout the Recovery members and their families

Offer temporary Shelter and foodMedical careHousing Day CareTransportation

Home Disaster PlanEMERGENCY SUPPLY KIT• Supplies to be including in any emergency kit:

– Water– Food– Battery-powered radio and extra batteries– Flashlight and extra batteries– First Aid kit– Whistle to signal for help– Dust or filter masks– Moist towelettes for sanitation– Wrench or pliers to turn off utilities– Can opener for food (if kit contains canned

food)– Plastic sheeting and duct tape to "seal the

room"– Garbage bags and plastic ties for personal

sanitation

Disaster Recovery Planning

Recovery Plan Format

What Should you use to write the Plan???

Is the Trustworthy word Processing Software enough?Most ConvenientNo training required

DRP Software Planning ToolsHow big is to big?

Planning Methodology

•Identify exposures

•Provide alternatives

•Define recovery strategy

•Develop solutions

•Document

Customer Environment

Analysis

Business Impact

Analysis

Analyze/Validate

The purpose is to ensure IT can provide services required to meet the business objectives in the event of a disaster

Business AnalysisIT Analysis

Data Gathering Data Gathering

Plan Test Maintenance Relocation

Cannot be approached casually

The Plan must be ....

–Well organized–Action Oriented –Comprehensive

Objective: Total restoration of Services in a timely manner

Disaster Recovery Planning

Develop and Implement the DRP

Disaster Recovery Planning Design Concerns

Minimize dependency on specific individualsEnsure completeness Ensure establishment of critical decisionsMinimize dependency on specific outside entitiesEnsure the plan is current - Living DocumentSignoff

Information Gathering Sample

1) Hardware configurations of all servers.

2) Software running on all equipment.

3) On each system detail, IP address, system name.

4) Backup procedures, rotations, tape naming convention

5) Vendor list supporting equipment.

6) Restoration procedures for system(s) if different.

7) Location of all required software to recover server(s).

8) Supporting network hardware.

Plan Elements

Plan Contents:•Mission statement•Disaster definition•Team responsibilities•Contact information•Critical documentation•Unique procedures•Recovery site inventory•Backup/recovery process•Implementation plan•Test plan•Maintenance•Relocation/migration plan

Mission Statement

The Disaster Recovery Plan has been developed to recover critical computer based applications and services within 2 days of a full scale disaster.

CarrierServices

Relocate Operations

Hotsite

WAN

Plan Elements

Executive Summary and Table of Contents

Scope & Assumptions & Definitions

Emergency and Notification Procedures

Disaster Declaration

Recovery & Resumption Site Procedures

Voice and Data Communications Requirements

Scope of the DR Plan

The DR Plan will provide immediate response & subsequent recovery from any unplanned computing service interruption, such as critical server failure, or catastrophic event such as aloss of facility.

Provide an organized and consolidated approach to managing response and recovery activities.

Recover essential operations in a timely manner.

DR Plan Assumptions

Only the primary site has been disabled by the disruption, all other facilities are unaffected.

The Off-site storage location for critical backup files is accessible.

Qualified personnel as identified in this document are available to perform Disaster Recovery responsibilities.

Plan Elements

People assignments, responsibilities & trainingSite: Selection and environment preparationVital records: Inventory and BackupSoftware Systems: Inventory and BackupApplication Systems: Inventory and Backup

OS/400 Recovery Sample

Licensed Internal Code RestoreLicensed Internal Code Restore at HotsiteBuilding your Disk Configuration at the HotsiteLicensed Internal Code Restore at Data CenterBuilding your Disk Configuration at the Data CenterRestoring the Operating SystemRecovering the BRMS ProductInitialize BRMS DeviceUser profilesDevicesNONSYSIBM LICPGM

Plan Elements

Hardware Inventory, Agreements, DocumentationCommunications, Current, BackupTransportation: Emergency RequirementsSupplies: Critical items - VendorsDocumentation: Inventory & Off-site BackupOther EquipmentVendor Contracts, Etc.Test Plans

Plan Elements

Appendix

Vendor Agreements & Contracts

Telephone calling tree

Team notification procedures

Recovery and resumption sites, addresses, telephone numbers

Call Tree Information

NameTitleAddress (Street address, not post office box number)Office telephone numberHome telephone numberPager number, if availableCellular telephone number, if availablePersonal Email Alternate telephone numberBlackberry or PDAFAX

Determine Personnel StatusPlan Activation ProceduresFirst Alert ResponseDisaster verificationPlacing Hotsite on AlertActivate Damage Assessment teamCommand CenterHot-Site Call up ProceduresTeam Responsibilities During a Disaster

Plan Activation

Declaring a DisasterDisaster Declaration PersonnelDirections to primary HotsiteDirections to Alternate HotsiteTravel InformationRecalling Offsite TapeHot-Site OpeningHigh Availability Role – Swap

Hot-Site Activation

Conclusion

•Identify exposures

•Provide alternatives

•Define recovery strategy

•Develop solutions

•Document

Customer Environment

Analysis

Business Impact

Analysis

Analyze/Validate

The purpose is to ensure I/S can provide services required to meet the business objectives in the event of a disaster

Business Analysis

Data Gathering Data Gathering

I/S Analysis

Plan Test Maintenance Relocation

Testing

Key Steps to DR Planning

Test Your Plan– Test your stated recovery scenarios– Test your restoration capabilities– Train your staff for response– Validate all assumptions– Timeline validation– Document required changes to your solutions and business

needs

Implementation Review

Review Documentation:OrganizationConsistencyIs it Clear?StaffingLack of Documentation ( Missing )

Validation Exercise

1. Does the plan meet the Recovery Time Objectives:

2. What is your RPO?

3. Have anything changed ?

Passive Testing - Format

Participants bring a their copy of the DR plan

Plan Manager reviews objectives of the test

Plan Manager starts the test

The Recovery team discusses the scenario

Recovery team executes the Recovery tasks

Key Objectives of a Passive Test

Validate Completeness and Accuracy of the Plan

– Team Organization– Call lists– Checklist for all team members– Contacts - Internal & External– Recovery Resources

Passive Testing

The exercise should:

State the objectives of the walk throughList the participantsSelect a scenario relative to your companyInclude the scenario in definition handouts

Summarize the changes for the Computer Contingency Plan and schedule for their completion

Passive Testing

Reduce the team !!!

Reduce the Recovery team by 20%.Examine LogisticsPut the logistics into actionCan we make it happen as written ???

Telephoning vendors after normal business hours to ensure that their hotline and service numbers are correct and manned

Simulation Exercises

Active Testing

A hands on exercise focus to determine how well the plan works.

Tests the Business on how impacted they are in a disaster.

Simulate Reality

Do not endanger your primary source of revenue.

Testing must not effect normal day to day business.

Active Testing

It’s not all bad news when your plan fails during a test…

…frequent testing identifies gaps in your recovery plan

– Hot-site ( Alternate Site )– Actual Reload ( Full scale, single system )– Communications switch– Assessment of Backup & Recovery procedures

Active Testing

Introducing Murphy

Does this sound familiar ?

Testing only done from full backupsSpecial backups performed to ensure successBackup tapes pre-shippedSame staff perform recovery steps – Alternates back running the shopNever test the whole thing – Communications, All servers in Enterprise

Tips

Test because your Business Depends on it.Pre-Arrange for serial dependent software KeysEnsure software & PTF’s are currentTest a Mid-Week scenarioInstall latest Backup/Recovery group PTFsKeep your hot-site aware of your hardware profileKnow where your LIC CD is locatedHave you performed a FULL save since the last

upgrade?

Successful Testing Requires Teamwork

“ The desire to recover is High. The time to test frequently is not there.

Testing is a partnership and ONLY successful

when everyone works together.”

115

Thank You

Questions

Richard Dolewski, CDRP

VP Business Continuity

rdolewski@wts.com

Tel. 1- 206- 436 - 3321

www.wts.com