Disaster Recovery Solutions With Virtual Infrastructuredownload3.vmware.com/vmworld/2005/sln104-a.pdf · Disaster Recovery Solutions With Virtual Infrastructure: Implementation andBest

Disaster Recovery Solutions With Virtual Infrastructure: Implementation andBest Practices

Govindarajan Soundararajan, VMwareSiva Kodiganti, VMware

Lokesh Krishnarajpet, VMware

Disaster Recovery Sessions (VMWorld 2005)

SLN104-A: Implementation and SolutionsSLN104-B: Backup and RecoverySLN104-C: Panel Discussion

Other sessionsLAB003: Backup and Disaster Recovery

AgendaIntroduction

Disaster Recovery (DR)DR process/methods

ESX DRBackup and restore (virtual machines, COS, ESX Server)High availability, SAN, multipathing, and clustering

QuizIf your data center is impacted by a disaster:

Which data (or servers) would you recover first?

Payroll/intranet/support/external website/on-line store

(This slide was picked from a disaster recovery presentation given by a professor of an university)

IntroductionA real world example of a disaster impact on a business unitTypes of disastersWhy DRB?Disaster mitigationBackup and recoveryHigh availability (SAN, multipathing) and clusteringDRB process

A Real World Example of a Disaster Impact on a Business Unit

(This is a general example and is not related to VMware in any way)

A popular auctioneer website experienced ~20 hours downtime causing $3 million to $5 million decrease in their quarterly revenue

Reason was that they had not applied the most recent critical patch

Types of DisastersNatural

Tornado, hurricane, flood, earthquake, fireMan-made

Accidents, operation error, theft, virus attacks, etc.System and facility failures

Hardware failuresSoftware failuresPower, A/C failures

Why DRB?Anticipate Problems

Hardware and Software failDisasters will happenAccidents happenPeople make mistakes

Be proactiveMinimize disruptionsStay in business

Disaster MitigationDisaster mitigation are preventative measures to protect against disaster and/or to minimize impact should one occur. It should include:

Monitoring web sites on your operating systems, applications and hardwareKeeping your OS and application patches up to dateKeeping your BIOS and firmware up-to-dateMonitoring security web sitesEstablishing a DR plan and implementing it

Backup and Recovery PlanningPreparing for:

Day-to-day mishapsSubsequent disaster recovery plans

Ensure:Servers are backed up on scheduleRecovery works on both source and target hardwareBackup and recovery process is documented

High AvailabilityRedundant features in hardware and software systems to facilitate business continuity in the face of disaster

A backup node will take over should the primary node fail

ImplementationsClusteringFault-tolerant componentsShared storage

DRB ProcessPlan

Identify assumptionsDetermine Service Level Agreements (SLAs) and critical servers with prioritiesCompare price of disaster recovery setup vs. price of disaster

ImplementSetup hardware and softwareConfigure servers, applications and scripts

TestTest each component for failure and fault toleranceAs data environment changes, review and revise plan, implement and test

ESX Server Disaster Recovery

Backup and restore and best practicesVMware backup and restore tools and best practicesHigh availabilityClustering

BackupsTypes of backups

IncrementalBackup of files changed since last backupBest for application data

DifferentialBackup of files changed since last full backupAlternative to Incremental requiring less tapes typically

Full backupBackup of all files and directories on a systemBest for full recovery after a system failure

BackupsHot backup

Machine is running at time of backupExamples:

Guest OS backup agentsVMsnap scripting

Cold backupMachine is powered off or suspended at time of backupExamples:

Manual virtual machine cloning (copy config, nvram, and disk files)VMware Virtual Center cloning

Backup Strategies

Best for system image

May need service console disk space for exported disks

Easy to boot a restored virtual disk

Virtual machine power-off required unless using redo logs

Full disk image only, including system files

Best for application data

Must install agent software in each virtual machine

Archived system data can only be restored into a running virtual machine

Application data can be archived without shutdown

Allows incremental, differential backup

From Host/Service ConsoleFrom Guest OS

Backup Best PracticesGuest OS backups for daily recovery needs such as file or directory restoresVirtual machine snapshots (disk images + config files) through service console to remote tape for quick bare metal restore of a virtual machineESX Server attached tape device for service console backup only (/boot, /, /home and optionally /vmimages)

Backup Best Practices

Non-ESX Server attached tape devices for guest OS backups, virtual machine snapshots and/or dsk backups from service consoleArchive backup tapes to offsite storagePerform test restores to validate backups

ESX Server Best PracticesSet up service console disks on an internal RAID-1 (two mirrored disks)Set up VMFS on SAN storage with redundant SPs, HBAs and FC switches. (Alternative: RAID-5)Backup your service console non-VMFS disks with standard backup agentsTreat the service console as a distinguished virtual machineHave your backup server on a dedicated piece of hardware

ESX Server RecoveryDR plan should ensure these key items are available at recovery location

ESX Server diskLicense keysESX Server configuration and customizationPrintout of virtual machine configurationsPrintout of other configurationsBackup media of images

.vmdk or .dsk, .redo and .redo.redo files

.vmss filesInstallation media for recovery software

Backing up ESX ServerCustom configurations

Monitoring toolsCustom settings and additional software

User and system informationUser and group information in /etc.System configuration information in /etc.

ESX Server and configuration filesBoot partition/etc/vmware directoryVirtual machine configuration files and nvram filesExtras: vmkusage, reports, esxtop

VMware Backup Toolsvmsnap.pl (ESX Server 2.x and higher)

Sets up redo logs and creates a hot backup by creating a set of .vmdk disk files from their corresponding .dsk disk file. The resulting .vmdk disks and related configuration files are automatically sent to an archival server or to a hot standby server to restore the virtual machineAdd a redo log to dsk file. If more than one dsk then perform for each dsk fileExport dsk file to vmdk file format

VMware Backup Toolsvmsnap.pl

Transfer vmdk file(s), virtual machine configuration file, nvram file and log files to an archive server or to a local storage locationAdd second redo logCommit first redo log followed by commit of second redo log


Do not rename vmdk files. Since the conversion of one dsk file can result in multiple vmdk files, the first vmdk file stores the names of the additional vmdk filesVmdk files created from ESX Server 2.0 systems are not compatible with restoration on ESX Server 1.5.2 systems


Creation of redo files do not affect running virtual machine performance. The first commit also does not affect running virtual machine performance. The second commit results in a brief pause, typically less than 1 second, of the virtual machineThe virtual machine disks should be shrunk before using vmsnap. This is to minimize the total size of the resultant vmdk files. Currently this must be done manually from within the virtual machine using the VMware Tools


Should the scripting be aborted, it is possible that a REDO and/or a REDO.REDO log may remain for a particular disk. If the ‘-r’ option of vmsnap does not do the commit correctly you may need to choose to restart the guest virtual machine and accept the commit requests that are presented in a pop-up window. This will ensure that when the machine comes up again there will be no redo logs

VMware Restore Toolsvmres

The vmres script is used to restore a virtual machine snapshot created with the vmsnap scriptThe script will retrieve all snapshot files of the original virtual machine from an archive server. This includes vmx, nvram, log and vmdk filesConfiguration files are copied into the specified owner’s vmware configuration directoryThe vmdk files are converted to dsk files and stored in the specified VMFS directoryThe virtual machine configuration file is updated, permissions are set and the virtual machine is registered

Backup Best PracticesBackup of guest OS using current backup software and scheduling to allow for file level and directory level restoresVirtual machine snapshots of guest OS using API scripting for “hot backups” to speed recovery of a virtual machine in the event of a disaster. Once virtual machine is setup using snapshot, guest OS backups can be used to fully recovery remaining changed data on the system since the snapshot timeframe

ESX Server RepairIn the event of MBR failure on your ESX Server, the following method can be used to perform a repair to the system:

Boot off a Red Hat 7.2 CD #1When the LILO prompt comes up, enterLinux rescueProceed through the screensWhen you get a shell prompt, enter:

chroot /mnt/sysimage /sbin/lilo -c -vThe system will rebootThis will recover your MBR

High Availability: SANAdvantages of SANs

Ease of moving virtual machines between ESX ServersPerformanceHigh availability and redundancyHigh performance backups and restores and use of snapshotsSecure and robust data transfersDual-redundant SAN fabrics, redundant HBAs and storage portsMetropolitan Area Network (MAN) and Wide Area Network (WAN) supportManagement

Data replicationDisaster recoveryDynamic expansion (easily scalable: ESX Server 2.0 support for disk extents)

SAN: MultipathingMultipathing

Maintain a constant connection between server and storage in case of a failure of HBA, switch, or storage controllermru, fixed policies for path failoverVerify status of available paths using:

vmkmultipath -q

ESX Server and SAN Best PracticesBuy SAN equipment that is on the supported hardware listUnderstand your SAN configuration and equipment operation prior to connecting an ESX Server to it. Make sure your ESX Server administrators and your SAN administrators communicateTest with non-production virtual machines first to validate configuration is correctEnsure that the cabling order is correct on all fabric switchesSpace reboots of ESX Servers accessing the same LUNs by at least 10 minutes to ensure correct setting of active paths to preferred paths

ESX Server and ClusteringClustering Support

Clustering is done via clustering software, and is not inherent in the VMware ESX ServerApplications running in virtual machines need to be cluster-awareRequires VMFS virtual disks to be on a shared volumeRefer to VMware ESX Server Administration Guide for clustering examples

ESX Server and ClusteringHow clustering works:

Application failsBackup application identifies failure and takes overClustering on virtual machines on the same physical server protects application failuresCluster across different physical servers for added redundancy

Use of clustering services in virtual machines provides high availability with less hardware (such as machines and network adapters)

Clustering Scenarios in ESX ServerCluster in a box

Multiple virtual machines on a single physical machineTo deal with software crashes or administrative errorsSupports heartbeat network without any extra network adapters

Clustering Scenarios in ESX ServerCluster across boxes

Consists of virtual machines on multiple physical machinesCan deal with the crash of a physical machinePlaced on a SAN with shared LUN storage (disks are ‘shared’)Heartbeat across VMNIC interfacesShared storage required (SAN)

Clustering Scenarios in ESX ServerCost-effective standby host

Provides a standby host for multiple physical machines on one standby box with multiple virtual machinesPhysical machine clustered with a virtual machinePhysical machine is primary, virtual machine is backupHeartbeat across VMNIC interfacesShared storage required (SAN)

Clustering Best PracticesDo not use undoable mode for disks used in clustersApplications must be stateless or have failover supportESX Server supports SCSI Level 2 reservations. Used by Microsoft Cluster Services and Veritas Cluster ServicesUse shared mode for VMFS when setting up clustering configurations. Allows controlled simultaneous access to filesHave separate LUNs for data and quorumUse separate LUN for each clusterClustering between virtual machines across ESX Servers

Specify scsi1.sharedBus = “physical” in each virtual machine’s configuration fileUse physical names for shared VMFSes rather than friendly names

scsi1:1.name = “vmhba0:1:0:1:quorum.dsk”scsi1:2.name = “vmhba0:1:0:1:shared.dsk”

Designate the VMFS as shared using the MUI or: vmkfstools –F shared vmhba0:1:0:1Public net should always be on a ‘bond’, not a ‘vmnic’, for fault toleranceHeartbeat should never be on a ‘bond’ for timing reasons (prevents full failovers). This is sometimes called ‘split braining the cluster’See VMware white papers on clustering or attend ESX Server: Advanced System Management course for training

VMware DocumentsVMware Backup Software Guide ESX ServerVMware Backup Software Compatibility GuideVMware ESX Server Backup Planning Technical NoteVMware ESX Server and Clustering White PaperVMware Backup ArchitectureVMware Backup PlanningVMware Disaster RecoveryUsing Veritas Backup Exec 9ESX Server 2 and Storage Area NetworksESX Server Administration Guide

Web SitesVMware Web Sites

http:/www.vmware.com/pdf/disaster_recovery.pdf(disaster recovery)http://wwvmware.com/pdf/ESXBackup.pdfhttp://wwvmware.com/pdf/ESXBackup.pdf(White paper on backup)http://www.vmware.com/customer/stories/jfs.html(One of customer setup)

Other Web Siteshttp://www.disasterrecoveryworld.com/There are numerous websites on the Internet (search on www.google.com. Web Site with “disaster recovery” or “business continuity” as keywords)