eBook a Guide to Modern IT Disaster Recovery

8/3/2019 eBook a Guide to Modern IT Disaster Recovery

1/7

IntroductIon: W Bk Sws,

w y wi y ?

chapter 1: e u

chapter 2: S wi Intelligent

VirtualFi

chapt

t 10

concl

r

appenIN

SIDE:

www.vmware.com

How to prepare or and mitigate the

Back Swan Events in your data center

A Guide to ModernIT Disaster Recove


2/7

INtroDuctIoN

Your data center IS Your caStle.Thats where

all your critical IT components hardware, data, and

sotware reside. You protect it with the latest bul-

letproo security solutions and mae it reliable through

redundant multiprocessing, highly scalable platorms,

and super-ast optical networs.

And yet, it is not ully protected rom orces beyond

your control such as natural disasters; man-made events

lie road closures; security procedures or partner service

interruption at a specic site.

Downtime and loss o data, even i temporary, can have

long-lasting eects or business and can contribute to

the demise o the otherwise well-lubed business:

q Loss o revenue rom your customers inability

to do business with you

q Diminished maret credibility and customer

trust, resulting in churn

q Penalties or violated SLAs with partners,

suppliers, distributors, and ranchisees

q Costs o recovering and repairing the lost data

q Legal costs o meeting internal and external

compliance requirements

How do yo

vestment e

investment

q 43% o

reope

q 93% o

days

q 40%

disas

acces

CIOs and I

which norm

adapt prac

deal with p

actions as w

Gartners Top

These sta

within your

1 McGladrey and2 National Arch3 Gartner, Dece

Expect the UnexpectedOur hope is you never need to activate an IT disaster reco

Our job is to provide automated protection if you do nee

chaptEr 1

You now VMware as the virtualization

company that has been the maret leader or

the past 11 years. In act, according to Gartner,

more than 80% o all virtualized applications

in the world run on VMware today. This eboo

highlights the VMware perspective on disas-

ter recovery in the data center. But lets orget

about IT or a moment.

The ty Bk Sw es is a

metaphor that encapsulates the concept o

surprise events that have a major impact.Itreers to unexpected events o large magnitudeand consequence and their dominant role

in history. Such events, considered extreme

outliers, play vastly larger roles than regular

occurrences.

The Black Swan, a boo by Nassim Nicholas

Taleb, explains that while Blac Swan Events

are unpredictable, a person or organization

can plan or negative events, and by doing

so strengthen their ability to respond, as well

as exploit positive events. Taleb contends that

people in general and specically enterprises

are very vulnerable to hazardous Blac

Swan Events and are exposed to high losses

i unprepared.

There is an obvious parallel between the

Theory o Blac Swan Events and the need or

disaster preparedness or your critical IT assets.

Deploying automated disaster recovery (DR)

is the way to protect IT and business rom

unpredictable events even Blac Swan

Events. The ollowing chapters explain the

basics o DR and the required inrastructure.

They also oer DR hidden realities and best

practices with real-world advice.

What are Back Swans, and what dothey have to do with your data center?

1

DR is the IT industrys way to prepare and f


3/7

3

untIl relIaBle vIrtualIzatIon ManaGeMent

SolutIonSbecame available several years ago,

DR solutions ell well short o satisying business

requirements due to the ollowing:

q High Cost

q Complexity

q Lac o Reliability

With traditional manual DR solutions, thehigh cost

came rom the need to deploy a second ailover site

with dedicated inrastructure, sotware licenses, and

human personnel. The complexitywas high because,

to ensure the recovery o entire business services, the

recovery plans had to manipulate many individual com-

ponents and moving parts: applications, hosts, networ,

and storage. Thelack of reliabilityo these procedures

was diminished by low automation and ina bility to test

any recovery procedure.

Many organizations had limited condence in meeting

their Recovery Point Objective (RPO) and Recovery

Time Objective (RTO) in the event o a disaster. IT de-

partments were hesitant to expand disaster protection,

uncertain whether the quality o the insurance was really

worth its cost.

Virtualization is undamental and critical to the success

o DR planning. Virtualization abstracts the complexity

o hardware and sotware and allows standardization o

processes, thus maing planning and automation o the

recovery procedures much more reliable and repeatable.

In act, in a recent IDG survey, 70% o customers

achieved improved BC/DR with virtualization.1

Anintelligent virtual inrastructure based on VMware is

the right oundation or the modern DR solution. Highly

adaptive and scalable, it is optimized or business-

critical worloads with built-in intelligence.

The VMware DR soution provides:

q t sims wy i

iis sy si

qt sims wy s y

migi s

qFy m, ms ib si

y migi

Start with an InTEllIGEnT andVIRTUAl FoundationReiabeg Repeatabeg Recoverabe

chaptEr 2

Virtua Recovery Process 4 Hours

Restore VM Power on VM

Physica Recovery Process 40 Hours

CongureHardware

Install OS InstallBacup

StartSingle-StepAutomaticRecovery

Congure OS

cs-i dr: With the rapid adoption o virtualiza-

tion and the evolution o replication technology, DR is

becoming more cost-ecient. Virtualization enables in-

rastructure consolidation at the ailover site. Less costly

replication options are more broadly available, using

lower-end storage appliances or stand-alone sotware

solutions. With these advances, DR can protect large-

scale mission-critical IT assets, as well as smaller sites

and Tier 2 applications.

am dr: In virtual environments, end users are

shielded rom the complexity o managing each step

in the recovery process. Now, a DR solution can auto-

matically execute and coordinate all the steps required

to ensure the desired level o protection. Traditional run-

boos are no longer good enough to manage recovery

plans and are replaced with sotware-driven recovery

plans. Setti

ment is as s

business se

rib si

tion, organ

that they c

provides th

a non-disruare now rep

ing the ris

predictable

The chart

virtualize

along wit

chaptEr 2 continued

How would you describe your organizations usage o the ocapabilities with itsproduction environment-based virtual m

0 20

39%

35%

Automated restart o virtual machines in the evento a physical server hardware ailure

Bacup and recovery solutions integratedwith the virtualization platorm

Site recovery solutions or virtual machines

Live migration o virtual machines based on CPU, memory,and networ utilization policies

Live migration o virtual machines

Live migration o storage associatedwith virtual machines

Automated deployment o virtualized servers basedon CPU, memory, and networ utilization policies

Automated enorcement o liecycle policies andreclamation o resources rom expired virtual machines

Automated deployment o virtual machines based onpower consumption policies

We currently use this eature/capability

W e h av e n o p la ns to us e t hi s ea tu re /c ap ab il it y D on t n ow

We plan to

1 IDG Research, Benets o Virtualizing Business Critical Applications,March 2011 Source: ESG white paper: Enterprise Strategy Group, 2011: Virtualizatio
http://www.vmware.com/solutions/cloud-solutions.htmlhttp://www.vmware.com/products/site-recovery-manager/overview.htmlhttp://www.vmware.com/products/site-recovery-manager/overview.htmlhttp://www.vmware.com/solutions/cloud-solutions.html


4/7

MYTH 3: a

y w

REAlITY

out testing

be tested w

validity. SR

recovery p

How its do

Adventist H

zation in th

roughly ou

Services (A

employs m

To ensure A

Zero initia

service and

systems li

record app

Adding SR

IS to stream

DR plannin

ing and tes

chaptEr 3 continuedWith SRM, setting up an automated recovery plan is

eortless and can be done in a matter o minutes, in-

stead o the wees required to set up manual runboos.

How its done at Swbk

Swedban is one o the largest nancial institutions

in Scandinavia and the Baltic, with 362 branches in

Sweden and 222 branches in Estonia, Latvia, and Lithu-ania. The ban serves 9.5 million private customers and

534,000 corporate customers, with 18,000 employees.

Preventing service disruption is critical to Swedban.

Swedban had to meet recovery targets or its legacy

applications through traditional means o bacup and

recovery, which were complex and time-consuming.

Swedban deployed SRM to simpliy and automate the

recovery process, management, and testing o recovery

plans. Since the SRM implementation, Swedban tests

its DR capabilities at least twice a year. It shuts down

one data center completely, moving worloads to the

surviving data center. It runs everything in the bacup

data center or 24 hours, and then ails over again to the

original data center.

Mart Nael, head o Core Inrastructure, Group IT at

Swedban, states, Our recovery time is below 30

minutes or mission-critical worloads and less than our

hours or the whole data center.

Bsiss ss:

q Positive ROI within one year rom hardware-

cost avoidance

q IT operational costs reduced 14% year-to-year

q 1,000 virtual machines managed by two ull-

time-equivalent staers

q 30x aster server provisioning

5

MYTH 1: diss ry is y ; is

si s smig.

REAlITY: VMware vCenter Site Recovery Manager

(SRM) gives you the fexibility to dene ailover scenar-

ios that meet your choice o coverage, speed, a nd cost

o recovery. For example, while a dedicated recovery

site is a robust solution (and yes, more expensive), in

many cases it is sucient to have a n active bidirec-

tional approach where two or more data centers are

complementary with enough capacity to pic up critical

applications. Thereore, no resources are wasted and

business continuity is maintained.

Overall, SRM customers consistently report signicantsavings o money, resources, and time.

How its done at cg limi

Challenger Limited issues annuities and provides

investment products and services. The organization

runs two co-located data centers, supporting around

500 sta in Australia.

In order to meet business requirements or ast recovery

and minimal data loss, Challenger Limited implemented

a dual-cluster VMware inrastructure that was lined

to networed storage devices in its two co-located

data centers, at about one-third the cost o a physi-

cal disaster recovery environment. SRM has enabled

the organization to dispense with most o the 50 tapes

previously used to bac up data, saving one person-day

per wee. In addition, Challenger Limited automated

hundreds o steps in its disaster recovery processes.

Bsiss ss:

q Improved RPO rom 24 hours to 90 minutes

and recovery time rom 24 hours to less than

our hours

q Reduced the number o people needed to restore

systems to one person

q Cut capital investment or disaster recovery to a

third o the cost o a physical environment

q Eliminated the need to acquire 15 standby

physical servers at a cost o $200,000

MYTH 2: aiig y mgig dr

si is m sk qiig si skis

si ss.

REAlITY: Not with VMware. Physical DR can be

complex because o duplicate and siloed inrastructures

and conguration synchronization issues across sites.

Virtualization encapsulates servers, OS, and applica-

tions, including all conguration data, so the complexity

is greatly reduced. Virtualization and automation ensure

that recovery plans are simple, complete, and can be

reliably executed by sta with no special sills required.

Disaster Recovery Myths and ReaitiesDisaster Recovery is ike an insurance poicy that you can

test without having an accident.

chaptEr 3 VMwMnnd e

s es
http://www.vmware.com/files/pdf/customers/11Q1_Swedbank_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/apac_au_10Q4_ss_vmw_challenger_english.pdfhttp://www.vmware.com/files/pdf/customers/apac_au_10Q4_ss_vmw_challenger_english.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_Swedbank_Case_Study.pdf


5/7

1.vii.Virtual environments are much more agileand easier to migrate. Virtualization hides the complex-

ity by shielding the individual components and moving

parts, thus simpliying the planning and increasing the

visibility into the DR process. It also allows you to use

hypervisor-based replication that is ar more fexible and

cost-ecient than storage-based replication.

2.am.Dont let human error stand in your way.Use automated recovery plans, not a stac o notes in

a binder. With the proper automation, a recovery plan

can be done in a matter o minutes instead o wees.

Automation shields users rom having to manage many

o the recovery steps, and automatically coordinates

activities such as preconguring networs and virtual

machines, conguring the recovery inrastructure, and

restarting applications.

3.viy s.Test your DR plans oten. Use non-disruptive testing o your recovery and ailbac plans.

Study a detailed report o the test outcomes, including

the RTO achieved. With this inormation, you can gain

the condence that your disaster protection plan meets

the business objectives. It will also provide the neces-

sary training to the sta and show any possible issues

early so they can be addressed.

4.S recovery ca

For examp

Exchange,

and started

To set your

tions and s

5.a Act early to

actual disas

condence

has been te

possible ts

6.B caused by

gone wron

data maint

migrating y

ris and gre

degradatio

Disaster Recovery Best Practices As shared by VMwares 5,000+ SRM customers

chaptEr 4

7

In addition to our 10 developmental centers, were also

responsible or maing sure providers throughout the

state get the support they need to receive unding rom

the ederal government, says Brian Brothers, networ

administrator manager. I our services were to go down

and we couldnt ensure reimbursement or Medicaid

unds, it would have a severe impact on the providers

and the developmentally disabled they serve. Some

providers might shut their doors.

At DODD, SRM is responsible or a reliable, veriable DR

enablement that can be tested and audited. The agency

has tested its disaster recovery solution twice. The

second test involved 50 production servers, which were

successully ailed over to the remote site in about 90

minutes. I we do have a true disaster someday, our

DR site becomes our production site. We expect to

be up and running in less than two hours, notes kipp

Berte, IT manager or inrastructure and operations,

Ohio Department o Developmental Disabilities.

DODDs disaster recovery site doesnt just sit idle.

Instead, the bacup site actively supports the

application development team on a daily basis.

Bsiss ss:

q A reliable disaster recovery site that can

be up and running in less than two hours

q Fully tested, active disaster recovery

solution implemented or an agile, private

cloud inrastructure

q Online systems providing aster, more

reliable service

chaptEr 3 continueda button. The act that we can run tests as oten as we

want gives us a high degree o condence in the recov-

erability o our systems, says kenneth Newball, senior

disaster recovery administrator at AHS-IS.

Bsiss ss:

q Reduced RTO by 75%, rom 48 hours to less

than one hour

q Eliminated the cost o fying a team o sevenpeople to test remote DR

q Cut hardware purchases by 84.5%, maintenance

by 93.1%, and power consumption by 90%

MYTH 4: dr s is sk s, ik i

s ms iky s.

REAlITY: Even i the big disaster never happens,

the recovery plan can be used as a migration plan

with similar steps, helping you during planned down-

times such as site migrations. In addition, DR planning

helps to ulll compliance where disaster recovery plans

are required. The outcome o recovery testing proves

disaster preparedness and the ability to meet RTOs.

How its done at oi dm

dm disbiiis

The Ohio Department o Developmental Disabilities

(DODD) runs a statewide system o support services or

some 80,000 people with developmental disabilities. A

disaster causing a systemwide ailure would have very

real human impact.
http://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdf


6/7

While your data center is critical to your ability to conductbusiness, events beyond your control (or even planned

ones) can mae IT services unavailable or highly limited.This situation, however rare, could be very damaging to theintegrity o your business, your maret credibility, and your

customers satisaction and loyalty.

You can mitigate this ris by implementing a DR solution to

protect your critical IT assets. A well-designed DR solutionbuilt on an intelligent virtual inrastructure can provide the

required RTO and RPO while eeping the costs in chec. YourDR plans can be tested in a nondisruptive way and benetyour IT department in areas beyond the typical DR needs.

Your IT inrastructure plays the most critical role or the

easibility and the ultimate success o your DR plans.Virtualized inrastructure proved to be the most reliable

and cost-eective platorm or DR by allowing you toabstract the moving parts and components o your datacenter, simpliying the replication architecture and requir-ing ewer resources overall.

So how do you start the journey to protect your IT assets?

Use this quic-start list as your guide:

1.Iiy y ms ii iis .What applications directly generate revenue, maintainsaety, or are otherwise critical to business continuity?

What data is absolutely critical or your customers, yourinternal accounting and nances, or compliance?

2. I y y, si iiig y ky

iis. This will not only cut much o the operationaland maintenance costs by removing unnecessary complex-

ity and operational cost, but it will also mae your environ-ment better suited or eective DR planning.

3.ag you lose? Fo

online with are realistic

4.df iiis

on the data

cally trigge

5. Iiy

is y

will be a comrecovery, an

6. S

ing specic

choices tha

the level o solution or t

Mae sure yactual disas

And nally,

a Blac Swation to reco

VMware is h

F m i

ry M

isivMw

For details aon deliverincontinuity, a

come you to

A Quick-Start Guide to Disaster RIt can be done. It must be done. VMware can help you do

coNcluSIoN

9

9.p ibk.Create and test a ailbacrecovery plan, set up replication in reverse, and now

when to trigger it. Agree on what to consider the

end o the disaster so your business can go bac

to normal.

10.d js w my dr.Utilize cheaper,commodity ailover site assets, or use the repurposed

hardware let over ater your primary data center has

gone virtual. Consider bidirectional or shared ailover

sites, use more sotware in the cloud (SaaS), and also

loo at non-IT DR means (UPS or bacup generators,

uel reserves, better re protection, etc.).

chaptEr 4 continued7.assig ssibiiis.Assign everybody involvedin the DR plan a specic tas. Dont expect the relevant

personnel to always be at the disaster site or to be in

control immediately. Implement necessary duplication

and redundancy or people, just lie you would do with

computers.

8.K y y s sssib.It is a good practice to prepopulate yourailover site with the data that doesnt change oten, or

by much. This will allow you to ocus only on the a st-

changing critical data at the time o ailover, and ulti-

mately meet your RTO with less eort.

.
http://www.vmware.com/products/site-recovery-manager/overview.htmlhttp://www.vmware.com/products/site-recovery-manager/overview.html


7/7

11

dISaSter recoverY IS a KeY part o a companys

business continuity initiative to ensure the availability o

integral IT-dependent business processes and prevent

any long-term negative eects o both planned and u

nplanned disruptions. The goal o DR is to restore critical

IT services as quicly as possible and minimize business

disruption.

Nothing impacts your ability to recover more than the

agility o your IT and applications inrastructure. Just

lie re saety must be built into the b uilding beore the

re occurs, and a car s saety eatures are engineered to

reduce crash impact, the design o your IT inrastructure

can mae or brea the success o your DR program.

It and applIcatIonS InFraStructure

Your data centers inrastructure plays an instrumentalrole in the eectiveness o your DR solution. The inra-

structure can mae DR very complex, hard to imple-

ment, and sometimes even impossible; or it ca n help

to mae your IT reliable, veriable, and eective. The

next section explains how.

Two key processes for simple and reliable

disaster recovery:

FaIlover

Failover is the capability to switch over to a redundant

or standby server, system, or networ upon the ailure

or termination o an existing asset. Failover should ha p-

pen without any ind o human intervention or warning.

FaIlBacK

Failbac is the process o restoring a system or another

asset that is in a ailover state bac to its original state.

Eective ailbac returns the system to the state o

operation beore the disruption.

diss ry 101 t Bsis

pimy Si ry Si

appENDIx

rto

While RTO is purely a technical metric, the decision to

trigger the ailover is a business one, and RTO can oten

tae much longer than the ac tual DR itsel. Whether

initiated by humans or by an automatic trigger, the

lead time to start DR should be also accounted or and

included in RTO. Replication is a ey element o any DR

process in most cases, usually provided by the specic

DR solution used.

replIcatIon

In the context o preparing or a ailover, replication

provides intentionally architected redundancy o your IT

resources: hardware, data, sotware, networs, or all o

them together. There are several actors in determining

the depth and amount o replication needed: type o

services to be protected, criticality o dierent compo-

nents, technology, and cost.

dISaSter recoverY ScenarIoS

Various DR scenarios and techniques are available to

meet your specic requirements and cost objectives. The

right architecture can mae your DR procedures more

ecient, cost-eective, and predictable. Here are a ew

commonly used congurations rom which to choose:

qai-pssi: This is a more traditional DR sce-

nario, where a production site running applications is

recovered at a second site that is idle until ailover is

required. In this scenario you are paying or a DR site

that is idle most o the time.

Key metrics for planning and measuring success of

the procedures.

rpo

Recovery Point Objective (RPO) is the point in time to

which you must recover data as dened by your organi-

zation, generally called an acceptable loss in a disaster

situation. It allows an organization to dene a window

o time beore a disaster when data may be lost and is

tightly dependent on the type o data replication used.

The higher granularity o data replication, the shorter

the RPO.

qai-a

worloa

Congu

the virtu

so that y

the wor

qBii

tion so t

at both s

spare ca

virtual e

ql F

ail over

when a s

orces y

qS

deploym

single re

multiple

All prote

this sing

recovery

that nee

This top

recovery

appENDIx continued

VMware vSphere

VMwarevCenter Server

Site RecoveryManager

VMwarevCenter Server

Site RecoveryManager

VMware vSphere

Servers Servers

Documents

eBook a Guide to Modern IT Disaster Recovery