eBook a Guide to Modern IT Disaster Recovery

Embed Size (px)

Citation preview

  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    1/7

    IntroductIon: W Bk Sws,

    w y wi y ?

    chapter 1: e u

    chapter 2: S wi Intelligent

    VirtualFi

    chapt

    t 10

    concl

    r

    appenIN

    SIDE:

    www.vmware.com

    How to prepare or and mitigate the

    Back Swan Events in your data center

    A Guide to ModernIT Disaster Recove

  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    2/7

    INtroDuctIoN

    Your data center IS Your caStle.Thats where

    all your critical IT components hardware, data, and

    sotware reside. You protect it with the latest bul-

    letproo security solutions and mae it reliable through

    redundant multiprocessing, highly scalable platorms,

    and super-ast optical networs.

    And yet, it is not ully protected rom orces beyond

    your control such as natural disasters; man-made events

    lie road closures; security procedures or partner service

    interruption at a specic site.

    Downtime and loss o data, even i temporary, can have

    long-lasting eects or business and can contribute to

    the demise o the otherwise well-lubed business:

    q Loss o revenue rom your customers inability

    to do business with you

    q Diminished maret credibility and customer

    trust, resulting in churn

    q Penalties or violated SLAs with partners,

    suppliers, distributors, and ranchisees

    q Costs o recovering and repairing the lost data

    q Legal costs o meeting internal and external

    compliance requirements

    How do yo

    vestment e

    investment

    q 43% o

    reope

    q 93% o

    days

    q 40%

    disas

    acces

    CIOs and I

    which norm

    adapt prac

    deal with p

    actions as w

    Gartners Top

    These sta

    within your

    1 McGladrey and2 National Arch3 Gartner, Dece

    Expect the UnexpectedOur hope is you never need to activate an IT disaster reco

    Our job is to provide automated protection if you do nee

    chaptEr 1

    You now VMware as the virtualization

    company that has been the maret leader or

    the past 11 years. In act, according to Gartner,

    more than 80% o all virtualized applications

    in the world run on VMware today. This eboo

    highlights the VMware perspective on disas-

    ter recovery in the data center. But lets orget

    about IT or a moment.

    The ty Bk Sw es is a

    metaphor that encapsulates the concept o

    surprise events that have a major impact.Itreers to unexpected events o large magnitudeand consequence and their dominant role

    in history. Such events, considered extreme

    outliers, play vastly larger roles than regular

    occurrences.

    The Black Swan, a boo by Nassim Nicholas

    Taleb, explains that while Blac Swan Events

    are unpredictable, a person or organization

    can plan or negative events, and by doing

    so strengthen their ability to respond, as well

    as exploit positive events. Taleb contends that

    people in general and specically enterprises

    are very vulnerable to hazardous Blac

    Swan Events and are exposed to high losses

    i unprepared.

    There is an obvious parallel between the

    Theory o Blac Swan Events and the need or

    disaster preparedness or your critical IT assets.

    Deploying automated disaster recovery (DR)

    is the way to protect IT and business rom

    unpredictable events even Blac Swan

    Events. The ollowing chapters explain the

    basics o DR and the required inrastructure.

    They also oer DR hidden realities and best

    practices with real-world advice.

    What are Back Swans, and what dothey have to do with your data center?

    1

    DR is the IT industrys way to prepare and f

  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    3/7

    3

    untIl relIaBle vIrtualIzatIon ManaGeMent

    SolutIonSbecame available several years ago,

    DR solutions ell well short o satisying business

    requirements due to the ollowing:

    q High Cost

    q Complexity

    q Lac o Reliability

    With traditional manual DR solutions, thehigh cost

    came rom the need to deploy a second ailover site

    with dedicated inrastructure, sotware licenses, and

    human personnel. The complexitywas high because,

    to ensure the recovery o entire business services, the

    recovery plans had to manipulate many individual com-

    ponents and moving parts: applications, hosts, networ,

    and storage. Thelack of reliabilityo these procedures

    was diminished by low automation and ina bility to test

    any recovery procedure.

    Many organizations had limited condence in meeting

    their Recovery Point Objective (RPO) and Recovery

    Time Objective (RTO) in the event o a disaster. IT de-

    partments were hesitant to expand disaster protection,

    uncertain whether the quality o the insurance was really

    worth its cost.

    Virtualization is undamental and critical to the success

    o DR planning. Virtualization abstracts the complexity

    o hardware and sotware and allows standardization o

    processes, thus maing planning and automation o the

    recovery procedures much more reliable and repeatable.

    In act, in a recent IDG survey, 70% o customers

    achieved improved BC/DR with virtualization.1

    Anintelligent virtual inrastructure based on VMware is

    the right oundation or the modern DR solution. Highly

    adaptive and scalable, it is optimized or business-

    critical worloads with built-in intelligence.

    The VMware DR soution provides:

    q t sims wy i

    iis sy si

    qt sims wy s y

    migi s

    qFy m, ms ib si

    y migi

    Start with an InTEllIGEnT andVIRTUAl FoundationReiabeg Repeatabeg Recoverabe

    chaptEr 2

    Virtua Recovery Process 4 Hours

    Restore VM Power on VM

    Physica Recovery Process 40 Hours

    CongureHardware

    Install OS InstallBacup

    StartSingle-StepAutomaticRecovery

    Congure OS

    cs-i dr: With the rapid adoption o virtualiza-

    tion and the evolution o replication technology, DR is

    becoming more cost-ecient. Virtualization enables in-

    rastructure consolidation at the ailover site. Less costly

    replication options are more broadly available, using

    lower-end storage appliances or stand-alone sotware

    solutions. With these advances, DR can protect large-

    scale mission-critical IT assets, as well as smaller sites

    and Tier 2 applications.

    am dr: In virtual environments, end users are

    shielded rom the complexity o managing each step

    in the recovery process. Now, a DR solution can auto-

    matically execute and coordinate all the steps required

    to ensure the desired level o protection. Traditional run-

    boos are no longer good enough to manage recovery

    plans and are replaced with sotware-driven recovery

    plans. Setti

    ment is as s

    business se

    rib si

    tion, organ

    that they c

    provides th

    a non-disruare now rep

    ing the ris

    predictable

    The chart

    virtualize

    along wit

    chaptEr 2 continued

    How would you describe your organizations usage o the ocapabilities with itsproduction environment-based virtual m

    0 20

    39%

    35%

    Automated restart o virtual machines in the evento a physical server hardware ailure

    Bacup and recovery solutions integratedwith the virtualization platorm

    Site recovery solutions or virtual machines

    Live migration o virtual machines based on CPU, memory,and networ utilization policies

    Live migration o virtual machines

    Live migration o storage associatedwith virtual machines

    Automated deployment o virtualized servers basedon CPU, memory, and networ utilization policies

    Automated enorcement o liecycle policies andreclamation o resources rom expired virtual machines

    Automated deployment o virtual machines based onpower consumption policies

    We currently use this eature/capability

    W e h av e n o p la ns to us e t hi s ea tu re /c ap ab il it y D on t n ow

    We plan to

    1 IDG Research, Benets o Virtualizing Business Critical Applications,March 2011 Source: ESG white paper: Enterprise Strategy Group, 2011: Virtualizatio

    http://www.vmware.com/solutions/cloud-solutions.htmlhttp://www.vmware.com/products/site-recovery-manager/overview.htmlhttp://www.vmware.com/products/site-recovery-manager/overview.htmlhttp://www.vmware.com/solutions/cloud-solutions.html
  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    4/7

    MYTH 3: a

    y w

    REAlITY

    out testing

    be tested w

    validity. SR

    recovery p

    How its do

    Adventist H

    zation in th

    roughly ou

    Services (A

    employs m

    To ensure A

    Zero initia

    service and

    systems li

    record app

    Adding SR

    IS to stream

    DR plannin

    ing and tes

    chaptEr 3 continuedWith SRM, setting up an automated recovery plan is

    eortless and can be done in a matter o minutes, in-

    stead o the wees required to set up manual runboos.

    How its done at Swbk

    Swedban is one o the largest nancial institutions

    in Scandinavia and the Baltic, with 362 branches in

    Sweden and 222 branches in Estonia, Latvia, and Lithu-ania. The ban serves 9.5 million private customers and

    534,000 corporate customers, with 18,000 employees.

    Preventing service disruption is critical to Swedban.

    Swedban had to meet recovery targets or its legacy

    applications through traditional means o bacup and

    recovery, which were complex and time-consuming.

    Swedban deployed SRM to simpliy and automate the

    recovery process, management, and testing o recovery

    plans. Since the SRM implementation, Swedban tests

    its DR capabilities at least twice a year. It shuts down

    one data center completely, moving worloads to the

    surviving data center. It runs everything in the bacup

    data center or 24 hours, and then ails over again to the

    original data center.

    Mart Nael, head o Core Inrastructure, Group IT at

    Swedban, states, Our recovery time is below 30

    minutes or mission-critical worloads and less than our

    hours or the whole data center.

    Bsiss ss:

    q Positive ROI within one year rom hardware-

    cost avoidance

    q IT operational costs reduced 14% year-to-year

    q 1,000 virtual machines managed by two ull-

    time-equivalent staers

    q 30x aster server provisioning

    5

    MYTH 1: diss ry is y ; is

    si s smig.

    REAlITY: VMware vCenter Site Recovery Manager

    (SRM) gives you the fexibility to dene ailover scenar-

    ios that meet your choice o coverage, speed, a nd cost

    o recovery. For example, while a dedicated recovery

    site is a robust solution (and yes, more expensive), in

    many cases it is sucient to have a n active bidirec-

    tional approach where two or more data centers are

    complementary with enough capacity to pic up critical

    applications. Thereore, no resources are wasted and

    business continuity is maintained.

    Overall, SRM customers consistently report signicantsavings o money, resources, and time.

    How its done at cg limi

    Challenger Limited issues annuities and provides

    investment products and services. The organization

    runs two co-located data centers, supporting around

    500 sta in Australia.

    In order to meet business requirements or ast recovery

    and minimal data loss, Challenger Limited implemented

    a dual-cluster VMware inrastructure that was lined

    to networed storage devices in its two co-located

    data centers, at about one-third the cost o a physi-

    cal disaster recovery environment. SRM has enabled

    the organization to dispense with most o the 50 tapes

    previously used to bac up data, saving one person-day

    per wee. In addition, Challenger Limited automated

    hundreds o steps in its disaster recovery processes.

    Bsiss ss:

    q Improved RPO rom 24 hours to 90 minutes

    and recovery time rom 24 hours to less than

    our hours

    q Reduced the number o people needed to restore

    systems to one person

    q Cut capital investment or disaster recovery to a

    third o the cost o a physical environment

    q Eliminated the need to acquire 15 standby

    physical servers at a cost o $200,000

    MYTH 2: aiig y mgig dr

    si is m sk qiig si skis

    si ss.

    REAlITY: Not with VMware. Physical DR can be

    complex because o duplicate and siloed inrastructures

    and conguration synchronization issues across sites.

    Virtualization encapsulates servers, OS, and applica-

    tions, including all conguration data, so the complexity

    is greatly reduced. Virtualization and automation ensure

    that recovery plans are simple, complete, and can be

    reliably executed by sta with no special sills required.

    Disaster Recovery Myths and ReaitiesDisaster Recovery is ike an insurance poicy that you can

    test without having an accident.

    chaptEr 3 VMwMnnd e

    s es

    http://www.vmware.com/files/pdf/customers/11Q1_Swedbank_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/apac_au_10Q4_ss_vmw_challenger_english.pdfhttp://www.vmware.com/files/pdf/customers/apac_au_10Q4_ss_vmw_challenger_english.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_Swedbank_Case_Study.pdf
  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    5/7

    1.vii.Virtual environments are much more agileand easier to migrate. Virtualization hides the complex-

    ity by shielding the individual components and moving

    parts, thus simpliying the planning and increasing the

    visibility into the DR process. It also allows you to use

    hypervisor-based replication that is ar more fexible and

    cost-ecient than storage-based replication.

    2.am.Dont let human error stand in your way.Use automated recovery plans, not a stac o notes in

    a binder. With the proper automation, a recovery plan

    can be done in a matter o minutes instead o wees.

    Automation shields users rom having to manage many

    o the recovery steps, and automatically coordinates

    activities such as preconguring networs and virtual

    machines, conguring the recovery inrastructure, and

    restarting applications.

    3.viy s.Test your DR plans oten. Use non-disruptive testing o your recovery and ailbac plans.

    Study a detailed report o the test outcomes, including

    the RTO achieved. With this inormation, you can gain

    the condence that your disaster protection plan meets

    the business objectives. It will also provide the neces-

    sary training to the sta and show any possible issues

    early so they can be addressed.

    4.S recovery ca

    For examp

    Exchange,

    and started

    To set your

    tions and s

    5.a Act early to

    actual disas

    condence

    has been te

    possible ts

    6.B caused by

    gone wron

    data maint

    migrating y

    ris and gre

    degradatio

    Disaster Recovery Best Practices As shared by VMwares 5,000+ SRM customers

    chaptEr 4

    7

    In addition to our 10 developmental centers, were also

    responsible or maing sure providers throughout the

    state get the support they need to receive unding rom

    the ederal government, says Brian Brothers, networ

    administrator manager. I our services were to go down

    and we couldnt ensure reimbursement or Medicaid

    unds, it would have a severe impact on the providers

    and the developmentally disabled they serve. Some

    providers might shut their doors.

    At DODD, SRM is responsible or a reliable, veriable DR

    enablement that can be tested and audited. The agency

    has tested its disaster recovery solution twice. The

    second test involved 50 production servers, which were

    successully ailed over to the remote site in about 90

    minutes. I we do have a true disaster someday, our

    DR site becomes our production site. We expect to

    be up and running in less than two hours, notes kipp

    Berte, IT manager or inrastructure and operations,

    Ohio Department o Developmental Disabilities.

    DODDs disaster recovery site doesnt just sit idle.

    Instead, the bacup site actively supports the

    application development team on a daily basis.

    Bsiss ss:

    q A reliable disaster recovery site that can

    be up and running in less than two hours

    q Fully tested, active disaster recovery

    solution implemented or an agile, private

    cloud inrastructure

    q Online systems providing aster, more

    reliable service

    chaptEr 3 continueda button. The act that we can run tests as oten as we

    want gives us a high degree o condence in the recov-

    erability o our systems, says kenneth Newball, senior

    disaster recovery administrator at AHS-IS.

    Bsiss ss:

    q Reduced RTO by 75%, rom 48 hours to less

    than one hour

    q Eliminated the cost o fying a team o sevenpeople to test remote DR

    q Cut hardware purchases by 84.5%, maintenance

    by 93.1%, and power consumption by 90%

    MYTH 4: dr s is sk s, ik i

    s ms iky s.

    REAlITY: Even i the big disaster never happens,

    the recovery plan can be used as a migration plan

    with similar steps, helping you during planned down-

    times such as site migrations. In addition, DR planning

    helps to ulll compliance where disaster recovery plans

    are required. The outcome o recovery testing proves

    disaster preparedness and the ability to meet RTOs.

    How its done at oi dm

    dm disbiiis

    The Ohio Department o Developmental Disabilities

    (DODD) runs a statewide system o support services or

    some 80,000 people with developmental disabilities. A

    disaster causing a systemwide ailure would have very

    real human impact.

    http://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdfhttp://www.vmware.com/files/pdf/customers/11Q1_DODD_Case_Study.pdf
  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    6/7

    While your data center is critical to your ability to conductbusiness, events beyond your control (or even planned

    ones) can mae IT services unavailable or highly limited.This situation, however rare, could be very damaging to theintegrity o your business, your maret credibility, and your

    customers satisaction and loyalty.

    You can mitigate this ris by implementing a DR solution to

    protect your critical IT assets. A well-designed DR solutionbuilt on an intelligent virtual inrastructure can provide the

    required RTO and RPO while eeping the costs in chec. YourDR plans can be tested in a nondisruptive way and benetyour IT department in areas beyond the typical DR needs.

    Your IT inrastructure plays the most critical role or the

    easibility and the ultimate success o your DR plans.Virtualized inrastructure proved to be the most reliable

    and cost-eective platorm or DR by allowing you toabstract the moving parts and components o your datacenter, simpliying the replication architecture and requir-ing ewer resources overall.

    So how do you start the journey to protect your IT assets?

    Use this quic-start list as your guide:

    1.Iiy y ms ii iis .What applications directly generate revenue, maintainsaety, or are otherwise critical to business continuity?

    What data is absolutely critical or your customers, yourinternal accounting and nances, or compliance?

    2. I y y, si iiig y ky

    iis. This will not only cut much o the operationaland maintenance costs by removing unnecessary complex-

    ity and operational cost, but it will also mae your environ-ment better suited or eective DR planning.

    3.ag you lose? Fo

    online with are realistic

    4.df iiis

    on the data

    cally trigge

    5. Iiy

    is y

    will be a comrecovery, an

    6. S

    ing specic

    choices tha

    the level o solution or t

    Mae sure yactual disas

    And nally,

    a Blac Swation to reco

    VMware is h

    F m i

    ry M

    isivMw

    For details aon deliverincontinuity, a

    come you to

    A Quick-Start Guide to Disaster RIt can be done. It must be done. VMware can help you do

    coNcluSIoN

    9

    9.p ibk.Create and test a ailbacrecovery plan, set up replication in reverse, and now

    when to trigger it. Agree on what to consider the

    end o the disaster so your business can go bac

    to normal.

    10.d js w my dr.Utilize cheaper,commodity ailover site assets, or use the repurposed

    hardware let over ater your primary data center has

    gone virtual. Consider bidirectional or shared ailover

    sites, use more sotware in the cloud (SaaS), and also

    loo at non-IT DR means (UPS or bacup generators,

    uel reserves, better re protection, etc.).

    chaptEr 4 continued7.assig ssibiiis.Assign everybody involvedin the DR plan a specic tas. Dont expect the relevant

    personnel to always be at the disaster site or to be in

    control immediately. Implement necessary duplication

    and redundancy or people, just lie you would do with

    computers.

    8.K y y s sssib.It is a good practice to prepopulate yourailover site with the data that doesnt change oten, or

    by much. This will allow you to ocus only on the a st-

    changing critical data at the time o ailover, and ulti-

    mately meet your RTO with less eort.

    .

    http://www.vmware.com/products/site-recovery-manager/overview.htmlhttp://www.vmware.com/products/site-recovery-manager/overview.html
  • 8/3/2019 eBook a Guide to Modern IT Disaster Recovery

    7/7

    11

    dISaSter recoverY IS a KeY part o a companys

    business continuity initiative to ensure the availability o

    integral IT-dependent business processes and prevent

    any long-term negative eects o both planned and u

    nplanned disruptions. The goal o DR is to restore critical

    IT services as quicly as possible and minimize business

    disruption.

    Nothing impacts your ability to recover more than the

    agility o your IT and applications inrastructure. Just

    lie re saety must be built into the b uilding beore the

    re occurs, and a car s saety eatures are engineered to

    reduce crash impact, the design o your IT inrastructure

    can mae or brea the success o your DR program.

    It and applIcatIonS InFraStructure

    Your data centers inrastructure plays an instrumentalrole in the eectiveness o your DR solution. The inra-

    structure can mae DR very complex, hard to imple-

    ment, and sometimes even impossible; or it ca n help

    to mae your IT reliable, veriable, and eective. The

    next section explains how.

    Two key processes for simple and reliable

    disaster recovery:

    FaIlover

    Failover is the capability to switch over to a redundant

    or standby server, system, or networ upon the ailure

    or termination o an existing asset. Failover should ha p-

    pen without any ind o human intervention or warning.

    FaIlBacK

    Failbac is the process o restoring a system or another

    asset that is in a ailover state bac to its original state.

    Eective ailbac returns the system to the state o

    operation beore the disruption.

    diss ry 101 t Bsis

    pimy Si ry Si

    appENDIx

    rto

    While RTO is purely a technical metric, the decision to

    trigger the ailover is a business one, and RTO can oten

    tae much longer than the ac tual DR itsel. Whether

    initiated by humans or by an automatic trigger, the

    lead time to start DR should be also accounted or and

    included in RTO. Replication is a ey element o any DR

    process in most cases, usually provided by the specic

    DR solution used.

    replIcatIon

    In the context o preparing or a ailover, replication

    provides intentionally architected redundancy o your IT

    resources: hardware, data, sotware, networs, or all o

    them together. There are several actors in determining

    the depth and amount o replication needed: type o

    services to be protected, criticality o dierent compo-

    nents, technology, and cost.

    dISaSter recoverY ScenarIoS

    Various DR scenarios and techniques are available to

    meet your specic requirements and cost objectives. The

    right architecture can mae your DR procedures more

    ecient, cost-eective, and predictable. Here are a ew

    commonly used congurations rom which to choose:

    qai-pssi: This is a more traditional DR sce-

    nario, where a production site running applications is

    recovered at a second site that is idle until ailover is

    required. In this scenario you are paying or a DR site

    that is idle most o the time.

    Key metrics for planning and measuring success of

    the procedures.

    rpo

    Recovery Point Objective (RPO) is the point in time to

    which you must recover data as dened by your organi-

    zation, generally called an acceptable loss in a disaster

    situation. It allows an organization to dene a window

    o time beore a disaster when data may be lost and is

    tightly dependent on the type o data replication used.

    The higher granularity o data replication, the shorter

    the RPO.

    qai-a

    worloa

    Congu

    the virtu

    so that y

    the wor

    qBii

    tion so t

    at both s

    spare ca

    virtual e

    ql F

    ail over

    when a s

    orces y

    qS

    deploym

    single re

    multiple

    All prote

    this sing

    recovery

    that nee

    This top

    recovery

    appENDIx continued

    VMware vSphere

    VMwarevCenter Server

    Site RecoveryManager

    VMwarevCenter Server

    Site RecoveryManager

    VMware vSphere

    Servers Servers