Achieving High Availability Objectives

Embed Size (px)

Citation preview

  • 8/9/2019 Achieving High Availability Objectives

    1/8

    In this white paper we will discuss the need

    for high availability and how it is defined and

    measured. Then well outline the shortcom-

    ings in common high availability designs and

    describe a methodology to address those

    shortcomings.

    Achieving HighAvailability Objectives

    W H I T E P A P E R

  • 8/9/2019 Achieving High Availability Objectives

    2/8

    Introduction

    In todays IT environments, the need for the highest levels of availability is a well-establishedprinciple. Businesses increasingly require immediate and continuous access to their infor-mation systems, and regularly set up traditional high availability software clusters to meetthis business objective.This common solution, however, is often not enough to meet busi-nesss high availability objectives.

    The purpose of this report is twofold. First, we will discuss the need for high availabilityand how availability is defined and measured. Secondly, we will discuss how to weed outthe shortcomings in common high availability designs and describe a methodology toaddress those needs.

    This report is written from a technology independent perspective. It takes no bias towardshigh availability software products, server, storage and network hardware vendors.Also, ourhigh availability discussion will focus primarily on open systems technology such as UNIXservers and Windows 2000/Windows NT servers.

    Defining and measuring high availability

    What is high availability?

    To begin with, it is important that we are all speaking the same language when we speakof high availability (HA). For the sake of the discussion here in this report, we will usethe following definition:

    An application environment is highly available if it possesses the ability to recover automaticallywithin a prescribed minimal outage window.1

    HA implies that no single point of failure (SPOF) exists in the application environment.An SPOF is any software, hardware or environmental component that, if it should fail,would take the application environment offline for an extended outage and requirehuman intervention to correct.

    It is also important to note what HA is not. HA is not continuous access to the application

    environment throughout failures.This area of availabilitycalled continuous availabilityisaddressed by such technologies as fault tolerant hardware, data center site redundancy, andreal-time remote data replication. Application environments requiring continuousavailability cannot sustain any kind of failure.

    What is downtime?

    The goal of high availability solutions is to automatically recover from functional down-time within a minimal outage window. By downtime we mean a service interruption atany layer of the application environment.

    It is important to understand clearly what we mean by the application environment.Theapplication environment refers to all the hardware and software required to support afunction provided to the business from IT. Largely we think of this environment as theservers, software, network and storage required for users to be able to execute a pro-gram. For example, the application environment of a web-based inventory applicationwould consist of the following:

    Employee and their workspace

    Desktop server

    All LAN connections

    Presentation server (http server)

    Application server

    2 W H I T E P A P E R

    Table of Contents

    Introduction 2

    Defining and measuring

    high availability 2

    Supporting the four pillars

    of high availability 4

    Achieving functional

    high availability 6

    Conclusion 8

    Figure 1: causes of downtime

    Source: Data Quest, November 1999

    0%

    5%

    10%

    15%

    20%

    25%

    30%

    1

    2

    3

    56

    4

    Software

    Hardware

    Humanerror

    Network

    Localenvironment

    Other

    1 The minimal outage window depends on the critical nature of the business being executed. Generally these windows are from 3 to 7 minutes.

  • 8/9/2019 Achieving High Availability Objectives

    3/8

    Database server

    Operating System (at each level)

    Application software (at each level)

    Data storage subsystems for each server

    Downtime is a service interruption at any or all of these layers.

    As mentioned above, downtime must be considered from a functional, or user perspective.Another way of putting it is to say anything that keeps a user from being able to conduct anIT supported business is functional downtime. It is important to note what a user meanshere.A user is any person or system that executes business functions with the designatedapplication environment. Users could be employees, customers, business partners, orother IT application environments.

    What causes downtime?

    Traditional high availability software packages typically focus their attention solely onhardware related failures.Yet a 1999 Data Quest study reported that only 23 percent ofservice interruptions were caused by hardware related failures. (See Figure 1.)

    The study showed that fully 27 percent of service interruptions were software related.From this statistic alone, we can gather that installing any solution that addresses onlyhardware related failures would be incomplete.

    Not surprisingly, software monitoring utilities have recently become more common-place, suggesting that software related failures are being acknowledged and addressedmore frequently.

    Other sources of downtime included human error (18 percent), network issues (17percent), local environment issues (8 percent), and other issues (7 percent).True highavailability can only be achieved when consideration is given to all areas that may causedowntime.

    What is the cost of downtime?

    Now that we understand what is meant by downtime, the first step in planning for highavailability is to understand the exposure posed to a company by the interruption of theapplication environment. It is often prudent to quantify the cost of downtime per hour ofthat environment. Quantifying the cost of downtime is helpful as it clearly and conciselydetails the risk a company faces. Deploying high availability solutions is one way of miti-gating that risk. Simply put, if you know how much you can lose, you know how much tospend in prevention.

    The cost of downtime varies tremendously by industry. A study recently published by theMeta Group puts the cost of downtime for many common industries anywhere between$0.5 million per hour to $3 million per hour (see Figure 2).These figures are based on

    an entire IT operations center being off-line. However, outages of a single server canrange in the several thousands of dollars per hour.These figures are just to be used as anexample to show that the revenue lost in an outage is substantial and should be investi-gated on an application environment basis for each company.

    Any accurate cost of downtime study must also consider the indirect costs associatedwith service interruptions. It is difficult to translate these numbers into loss per hour ofdowntime, but to deny that such intangibles contribute to cost is shortsighted. Examplesof these intangible costs may be decreased customer satisfaction, penalties for failure tomeet service level agreements or a legal liability associated with failure to provide serv-ice.This is especially relevant in the healthcare and financial industries.

    Achieving High Availability Objective

    Figure 2: cost of downtime

    Source: Meta Group, Individual.com, October 2000

    Industry

    Energy

    Telecommunications

    Finance

    IT-dependent mfg.

    Healthcare

    Media

    Hospitality/Travel

    $ per hour

    $3M

    $3M

    $1.5M

    $1.5M

    $0.5M

    $0.5M

    $0.5M

  • 8/9/2019 Achieving High Availability Objectives

    4/8

    A formal Business Impact Analysis (BIA) may be an appropriate means to calculate thecost of downtime.These types of studies can be conducted to outline all associated riskswith service disruption, but they also go further than strictly investigating IT infrastruc-ture.A BIA is a more robust study to clearly understand exposures faced by a company.

    The cost of availability

    Once we have quantified the cost of downtime we can then evaluate availability technol-

    ogy on the market. Different recovery and availability objectives dictate different costs;these range from large outage window, low-cost solutions such as offsite backup tapestorage, to small outage window, high-end solutions such as remote disk mirroring andmultiple data centers. Figure 3 shows that the initial costs of narrowing our recoverywindows grows substantially as we approach fault tolerant solutions.

    How do you calculate expected uptime?

    In todays market it is very much in vogue to brag about the number of nines of avail-ability your hardware solution demonstrates. For instance, a data storage subsystem maybe built with a tremendous redundancy so that it is capable of 99.999 percent uptime.In a 24x7x365 shop, 99.999 percent translates to about 5.25 minutes of downtime per

    year. However, your entire application environment likely does not run solely on yourdata storage subsystem.You likely have LAN connections and multiple sever layers, aswell as storage subsystems. All of these have an impact on your uptime.

    For the sake of argument, lets enhance the above application environment to include arealistic scenario. Figure 4 describes a typical application environment, with each ele-ments anticipated uptime/downtime.This is a rough number of expected downtime

    4 W H I T E P A P E R

    Figure 3: cost of availability

    Redundant sites

    Hot site disk mirroring

    Hot site remote tape vaulting

    Recovery Time

    Cost

    Local high availabil ity clusters

    Local disk copies

    Local tape vaulting Daily tape copy off site Weekly tape copy off site

    Figure 4: expected downtime in a

    typical application environmentLayer

    All LAN connections

    Presentation server

    Application server

    Database server

    Database application

    software

    Data subsystem for

    database only

    Total

    Uptime

    99.9%

    99.2%

    99.7%

    99.995%

    99.3%

    99.999%

    Downtime per year

    8.76 hours

    70.08 hours

    26.28 hours

    26.25 minutes

    61.32 hours

    5.25 minutes

    167 hours

  • 8/9/2019 Achieving High Availability Objectives

    5/8

    within the environment per year.The most important point to note is that these avail-ability numbers assume that all other criteria for successful high availability solutionshave been met.The next section describes these criteria and how they can dramaticallyaffect functional availability of an application environment.

    Supporting the four pillars of high availability

    When you look at all the potential causes of downtime and the extraordinary costs thatthose outages can bring, you realize that successfully achieving your availability objec-tives is critical and complex. Standard high availability software packages only begin theprocess of addressing complete functional availability needs.

    The technology of the hardware and software is simply not enough.We must alsoaddress other areas that affect availability such as:

    An adequate and well trained staff

    Change management policies and problem determination policies that are detailed andspecific.These policies must be known and respected by all the staff

    Adequate environment monitoring tools

    Successful backup/recovery and disaster recovery tools and plans

    If any one of these areas is not adequately addressed the availability of the applicationenvironment will be in jeopardy.

    We call these different areas of application environment support pillars, and categorizethem into four groups: infrastructure, business contingency, support services, and operations.

    We take the approach that availability objectives are achieved with a combination ofhardware and software technology brought together by a philosophy of availability.Simply put, the philosophy is as follows:

    High Availability of an application environment is achieved when all pillars of that environ-

    ment are adequately supported.

    In some capacity every application environment contains the four components we list aspillars. However, it is a matter of opinion as to which items are placed in which pillar.Often the items within each pillar address a multitude of application environments.What is crucial is that all items that affect availability are placed in pillars for examination.Understanding these pillars and shoring up any weaknesses provides a solid foundation foraddressing availability effectiveness.

    The infrastructure pillar

    We divide the infrastructure pillar into three parts.The first area focuses on the hard-ware and software associated with an application environment.These are largely theservers, the operating systems, databases, specific availability software solutions andother relevant applications.

    The second major area of the infrastructure pillar is the shared storage infrastructure.Whiletechnically another component of the hardware solution, the shared storage really standsalone as a crucial piece of the overall availability solution.This area of the pillar requiresfocus on the storage hardware technologies such as enterprise storage arrays and their asso-ciated data management software.These tools can move data from one storage device toanother or provide for real time mirror copies, both in local as well as remote locations.Also important to the storage infrastructure is networked storage such as the storage areanetwork (SAN), the network attached storage (NAS), and the IP storage solutions.

    Achieving High Availability Objective

    Figure 5: pillars of high availability

    Business Application

    Inf

    rastructure

    Busin

    ess

    Continuity

    SupportServices

    O

    perations

  • 8/9/2019 Achieving High Availability Objectives

    6/8

    The last part of the infrastructure pillar is the physical environment in which the solu-tion is housed.This refers to environmental conditions such as raised floor space, propercooling, independent power circuits and placement of servers in racks and on floors.

    The business contingency pillar

    The business contingency pillar is designed to focus attention on the technology ofrestarting business once it has been interrupted.This restart normally comes in the form

    of a manual effort such as restoring data in a local recovery solution, or restoring a pro-duction operating environment in the form of business continuance.

    This pillar covers two major areas: local backup and recovery solutions, and businesscontinuance solutions.The local backup and recovery solution has obvious influence onapplication availability as nearly all applications have a significant data impact. Focus hereis on the use of the technology and the retention policies of the data.

    The business continuance, or disaster recovery, solution is also closely coupled with theavailability of application environments.The focus here is on the use of technology,information from reports such as a business impact analysis, and execution of disasterrecovery testing.

    The support services pillarThe support services pillar focuses primarily on two areas: security and networks. It isclear that lack of sufficient security can negatively affect application availability.This areashould focus on the use of firewalls and intrusion detection software. In addition, poli-cies concerning password management and server access must be closely examined.

    Networks and connectivity are also critical to application availability. Areas such asredundancy in the network architecture and throughput analysis should be investigatedto understand their influence on the ability to execute a business process.Another keypiece of network support services is the ability to quickly diagnose and repair networkrelated issues. Critical to this success are detailed diagrams of all network segments;these should be regularly updated and distributed to support teams.

    The operations pillarThe final pillar affecting availability is the operations pillar.This pillar covers all areaspertaining to the routine day to day management of the application environment.Theseareas include system administration, problem management, change management, 24x7monitoring of the environments and compliance with business service level agreements.

    Achieving functional high availability

    Once the pillars of availability are defined in a given application environment, carefulconsideration must be given to all areas that can negatively affect availability.These areasshould be investigated to determine if they are sufficient to meet the application envi-ronments availability objectives. If specific areas leave holes in the availability umbrella,action can be taken to correct the shortcomings.

    Perform an assessment of availability effectiveness

    The best way to ensure availability objectives are being met is to perform an availabilityeffectiveness assessment of the application environment.This investigation should beconducted through a series of server and environment interrogations and interviewswith key staff.The investigation should study each of the four pillars in three differentdimensions: tools, staff, and procedures (see figure 5).

    In general the tools of a pillar refer to the hardware and software components installedto meet specified technology needs.We must discover whether tools exist to supportthis pillar, whether they are used or known, and whether the current tool is effective in

    6 W H I T E P A P E R

    Figure 6: supporting the pillars

    of high availability

    ToolsStaff

    Procedures

    High Availability

    Infrastructure

    Business

    Continuity

    SupportServices

    Operations

  • 8/9/2019 Achieving High Availability Objectives

    7/8

    supporting the pillar. A critical tool to investigate is the presence of customized diagramsand documentation that clearly depict the application environment.These can be server,network and storage configurations.

    The staffassociated with a pillar is the employees, the managers and the consultantsneeded to support that pillar.This staff must be adequately trained with adequate num-bers. For example, it is never a good idea to have only a single person who is capable of

    providing system administration duties for critical servers.The staff must be well sup-ported and represented by management.And they must have adequate training in tech-nology supporting future IT initiatives. Overall, to support functional high availability,critical staff should be self-sufficient. Contracted remote monitoring services are benefi-cial for supplementing and aiding critical staff, but avoid dependence upon outsidegroups and contractors for critical functions.

    The procedures associated with the support of a pillar should focus on how and whytechnology is used to meet availability objectives. Most importantly, they must be docu-mented and known to all. Far too often we allow smart minds to contain far too muchcritical information without asking for them to write it down.These procedures shouldbe clear so that even the simplest of minds can follow them. All should be educatedonand instructed to followthe procedures. Lastly, all procedures should constantly

    be evolving, or regularly reviewed and updated.

    The availability effectiveness assessment should provide feedback in two fundamentalways. First, there should be a quantitative analysis of the findings.This can be nothingmore than a score for each dimension of each pillar. A more detailed report woulddescribe a numerical response to a constant set of questions.These scores can then beweighed against their importance in an application environment.The score should reflectone of three fundamental categories affecting availability: 1. An item does not exist, 2.An item exists but is insufficient, or 3. An item exists and is sufficient. For example astandard question to ask in an availability effectiveness assessment is whether the compa-ny has a change management policy to govern these critical servers. If you ask that ques-tion of different IT staff you may get different answers.The manager who authored thepolicy may reply the policy exists and is sufficient (score of 3). However, the half dozen

    people who would regularly use the policy may state it does not even exist (score of 1).

    This quantitative score can then be compared to a perfect score, or if similar questions areasked, the score of another application environment.A quick overview of such scores canshow whether deficiencies exist in tools, staff, or procedures within a particular pillar.

    The second crucial method for providing feedback should be a qualitative approach.The person evaluating the availability effectiveness based on interviews should draftthis. It should report on responses to prepared questions, especially when thoseresponses differ from person to person. For example, an employee may report that abackup and recovery tool exists, but is totally worthless. On the other hand, a man-ager who is asked the same question might reply that the backup and recovery toolexists and completely meets their needs.

    In a qualitative evaluation it is also important to note what is not said. If intervieweesconsistently avoid discussing a peers ability to manage the systems, further investigationmay be required to investigate whether that person is competent.

    Finally, the findings from an availability effectiveness assessment should be communicated tothose people responsible for the availability of an application environment.This is often theChief Information Officer, or IT director.This can be done by creation of a formal report orpresentation. It is important that these results are documented so that follow-up investiga-tions can be performed to see if these problems still exist or if they have been alleviated.

    Achieving High Availability Objective

  • 8/9/2019 Achieving High Availability Objectives

    8/8

    Focus on shoring up the weakness found in each pillar

    Based on the reported findings of an availability effectiveness assessment, plans can beput in place to address the exposed shortcomings.These rollout plans should becomepart of the overall high availability implementation project.They should be documented,tested and implemented in conjunction with the chosen high availability technology.

    Since the hardware and software technology associated with high availability is fairly well

    understood, the changes required to improve availability effectiveness often do notrequire the capital acquisition of technology. Rather, they require proper creation andmanagement of policies and procedures. If is often most effective if these policies arecreated and managed internal to a company, as full-time employees tend to have the bestinsights into what will be fruitful solutions.

    In general, it is most effective to first address issues that are most import to an applica-tion environment. For example, having clustering software installed and running withouta trained staff to support it can often affect system availability more negatively than nothaving clustering software at all.

    Conclusion

    In nearly all aspects of todays business world the availability of the underlying IT infra-structure is crucial. Even being off-line for a short time can have a tremendous effect ona companys health and economic viability. But preventing these outages cannot beaddressed simply by a technology solution. People, policy and procedures can have a farmore significant impact on availability. Functional high availability can only be achievedthrough an effort to investigate all areas that can stifle the business transaction.

    If any company is considering deploying high availability solutions, it is also critical thatthey consider availability from the users perspective. Specifically, an investigation shouldbe performed to understand how effective a solution would be in terms of user inter-action and satisfaction.

    CNT offers highly trained professionals to perform availability effectiveness assessments

    and design solutions that provide the highest levels of functional availability.These assess-ments can be structured to pinpoint holes in existing high availability architectures aswell.This gap between the business needs for IT availability and an organizations abilityto meet those needs must be bridged. CNT follows up each assessment with a compre-hensive plan to implement solutions that will help your company meet its functionalhigh availability needs.

    CNT is one of the worlds largest providers of comprehensivestorage networking solutions. For over 20 years, our experts haveanalyzed, designed, and built enterprise storage networks.

    Visit www.cnt.com to learn about our solutions, products, partner-ships, career opportunities, and more.

    2003 by Computer Network Technology Corporation (Nasdaq:CMNT). All rights reserved. Any reproduction of these materialswithout the prior written consent of CNT is strictly prohibited. CNT,the CNT logo, Channelink, and UltraNet are registered trademarks ofComputer Network Technology Corporation. All other trademarksidentified herein are the property of their respective owners. CNT isan equal opportunity employer. CNT corporate headquarters QMS isregistered to ISO 9001: 2000. Certificate #006765.

    U S A : 1 - 8 0 0 - 6 3 8 - 8 3 2 4 C a n a d a : 9 0 5 - 5 9 5 - 1 5 0 0U K : 4 4 - 1 7 5 3 - 7 9 2 4 0 0 F r a n c e : 3 3 - 1 - 4 1 3 0 - 1 2 1 2Austral ia: 61-2-9540-5486 Germany: 49-89-42 74 11-0Switzer land: 41-1-73 35-733 Belg ium: 32-2-737 76 42I t a l y : 3 9 - 0 6 - 5 1 4 9 3 1 B r a z i l : 5 5 - 1 1 - 5 5 0 9 - 1 5 0 4Japan: 813-5403-4858 Other locations: 1-763-268-600

    PL581 | 0803