Metro Cluster High Availability or SRM Disaster Recovery?

© 2014 VMware Inc. All rights reserved.

Metro Cluster High Availability or SRM Disaster Recovery?

David Pasek, VMware PSO, TAM, VCDX #200

Stanislav Jurena, VMware PSO, TAM, VCAP-DCD/DCA

Demystifying myths

VMUG Prague, 2016 Dec 6

CONFIDENTIAL 2

Agenda

1 Business Continuity

2 High Availability

3 Disaster Recovery

4 Disaster Avoidance

5 Multisite High Availability or Disaster Recovery?

6 Q & A

3

Business Continuity

Business Continuity - Definition• Business continuity encompasses planning and preparation to ensure that an

organization can continue to operate in case of serious incidents or disasters and is able to recover to an operational state within a reasonably short period. As such, business continuity includes three key elements and they are– Resilience (High Availability) : critical business functions and the supporting

infrastructure must be designed in such a way that they are materially unaffected by relevant disruptions, for example through the use of redundancy and spare capacity;

– Recovery (Disaster Recovery): arrangements have to be made to recover or restore critical and less critical business functions that fail for some reason.

– Contingency: the organization establishes a generalized capability and readiness to cope effectively with whatever major incidents and disasters occur, including those that were not, and perhaps could not have been, foreseen. Contingency preparations constitute a last-resort response if resilience and recovery arrangements should prove inadequate in practice.

Source: https://en.wikipedia.org/wiki/Business_continuityWHAT IS NOT MENTIONED IN WIKIPEDIA– Mitigation (Disaster Avoidance): the organization can improve contingency planning

with mitigation planning. Do something proactively to avoid unexpected disasters.

4

https://en.wikipedia.org/wiki/Resilience_(organizational)

https://en.wikipedia.org/wiki/Data_recovery

https://en.wikipedia.org/wiki/Contingency_management

https://en.wikipedia.org/wiki/Contingency_management

https://en.wikipedia.org/wiki/Data_recovery

Business Continuity - Terminology• General concepts and terminology

– Business Continuity – must be based on BIA (Business Impact Analysis)• RPO (Recovery Point Objective), RTO (Recovery Time Objective) – Infrastructure level• WRT (Work Recovery Time) – Application level• MTD (Maximum Tolerable Downtime) = RTO + WRT – Business level

– High Availability, Disaster Recovery, Disaster Avoidance– Availability Zones, Regions

5

Less then ~60km

More then ~ 60km

High AvailabilityDisaster Recovery Data Protectio

n

6

Business Continuity / High Availability

7

Business Continuity / High Availability• High Availability technologies

– Self initiated failover without human intervention– Master node or software arbiter is required– For multisite HA solution third site for arbiter is required

• VMware HA Cluster Solutions– Local vSphere High Availability Cluster (vSphere HA)– Multisite vSphere Metro Storage Cluster (vMSC)

8

Single site vSphere HA Cluster

Single Site Shared StorageFC, iSCSI, NFS, VSAN

We all know that, right?

Local vSphere HA Cluster (single clustered system in single availability zone)

• Protection against • Physical server failure (ESXi Hosts monitoring)• OS failure on top of ESXi (Guest OS monitoring)• App failure on top of ESXi (App Monitoring)

• System Requirements• Shared local storage (Fibre Channel, SAS, iSCSI,

NFS, SDS like VSAN)• Flat L2 Networks for VMs

• Software arbiter - Master node of HA Cluster

9

Multisite vSphere Metro Storage Cluster

Multisite Shared StorageFC, iSCSI, NFS, VSAN

Not so common in the field but very popular topic.

Multisite vSphere Metro Storage Cluster (single clustered system over two availability zones)• Protection against

• Various Storage Array Failures• Whole Single Site Storage Array Failure• Complete Site Failure• Anticipated disaster (Disaster Avoidance)

• System Requirements• Shared stretched storage Volumes / LUNs

distributed across two storage arrays and visible/mounted to ESXi

• Third zone required because of arbiter in 3rd zone• Flat L2 Networks for VMs

Distributed LUN across two storage systems

Storage System A Storage System B

Storage Witness

Business Continuity / High Availability - HA2

Metro Storage Cluster (vMSC)– Advantages

• Positive impact on RTO during single storage or site failure– faster disaster recovery because VMs are automatically restarted without human interaction

• Higher Protection (redundancy) against specific infrastructure failures– Protection against single storage array failure– Protection against complete site failure

• Non-disruptive Disaster Avoidance– VM workloads vMotion between availability zones– VMs does not need to be restarted = higher VM availability SLA can be achieved

• Operational Simplicity– Design, Implement, Test and Forget. Then pray that it will work when needed.– Schedule periodical tests to be sure it really works.

– Disadvantages• Single stretched fault zone• Complex clustering techniques highly dependent on particular storage vendor• No test plan - it can be tested only by real failure simulation

– Business critical application owners will not accept real failures.

• App start order and dependency cannot be achieved = negative impact on WRT and MTD• Third site is required for software arbiter (arbiter, witness, tie-breaker)

10

11

Business Continuity / Disaster Recovery

Business Continuity / Disaster Recovery• SRM - VMware DR technology = human initiated failovers – human arbiter

– Should be implemented between regions but can be implemented between availability zones as well

– Only two regions are required because human arbiter can run recovery from anywhere without split brain

– Can be implemented for more regions – N : M– Independent Fault Zones - Data Replication and L3 network are the only

common denominators among sites– Network connectivity should be L3 (routed) to mitigate fault propagation

(broadcast storms, unknown unicasts flooding, etc.) – All infrastructure services has to be duplicated on each region (NTP, DNS,

Active Directory, vCenter, etc.) – DR orchestration = Application Dependencies (start order) can and should be

specified

12

Business Continuity / Disaster Recovery• DR (VMware SRM)

– Advantages• Positive impact on WRT

– VMs restarts with priority orders and application dependency – RunBook (SRM Recovery Plan)

• Independence on other region failures• Mitigation of false positive failures and unnecessary failovers

– Human initiation of DR failover – business approval required

• DR tests without impact on production – Detail report of performed DR tests

– Disadvantages• Higher RTO

– Have to wait for human interaction (Business approval before failover)– Storage Replication has to be break and volumes / LUNs has to be mounted to ESXi hosts on recovery sites– all VMs in single recovery plan are started in parallel but only 10 recovery plans can be executed concurrently

• Operational and Business overhead– BIA must exists– Protection groups and Recovery Plans has to be defined based on BIA– Recovery Plans has to be tested– Operational personnel has to be trained

13

14

Business Continuity / Disaster Avoidance

15

Business Continuity / Disaster Avoidance• Disaster Avoidance is preventive failover to another availability zone to

avoid anticipated disaster• Failover with service disruption

– Option 1: SRM fail-over• Two independent vCenters in two independent SSO domains• VMs graceful shutdown• VM re-start in correct order in another region / availability zone

• Failover without service disruption– Option 1: vSphere Metro Storage Cluster (vMSC)

• Stretched LUN / datastore across availability zones (storage vendor specific technology)• VMware VM vMotion (CPU, RAM)

– Option 2: vMotion without shared storage• VMware vMotion within single vCenter or cross two vCenters in single SSO domain• VMware VM vMotion (CPU, RAM)• VMware Storage vMotion share nothing (vDisk)

– Option 3: SRM cross vCenter vMotion without shared storage• Two independent vCenters in two different SSO domains• VMware VM vMotion (CPU, RAM)• VMware Storage vMotion share nothing (vDisk)

16

Multisite High Availability (Metro Cluster) or Disaster Recovery?

Infrastructure Design Qualities• Availability <= High Availability• Manageability• Scalability• Performance• Security• Recoverability <= Disaster Recovery• Cost

Multisite HA (Metro Cluster) or Disaster Recovery?• vSphere Storage Metro Cluster (vMSC) is High Availability solution great for

– Protection against complete storage system failure

– Non-disruptive Disaster Avoidance between availability zones

– Protection against complete site failure with low RTO but unpredictable WRT and MTD

• but Metro HA (vMSC) is not real Disaster Recovery because of– Workload restart order unpredictability

– Single system (fault zone) stretched across sites

– Very hardly testable

– Shorter distance protection (< ~60km)

• Real VMware Disaster Recovery solution is SRM– Predictable recovery plans

– Testable recovery plans without impact on production

– Longer distance protection (> ~60km)

• So, what technology should I use?– It always depends on business requirements (BIA) and what you want to achieve

– Stretched Metro HA Cluster (vMSC) for HA2 and Disaster Avoidance

– SRM for Disaster Recovery

– Both solutions can be used together – vSphere Storage Metro Cluster protected by SRM

17

Questions and AnswersTwitter: @david_pasekBlog: http://blog.igics.com

http://blog.igics.com/

19

Backup slides

20

Metro cluster (vMSC) topologies

21

Multisite vSphere Metro Storage ClusterPhysical Infrastructure Logical Design

22

Multi site vSphere Metro Storage ClustervMSC Logical Design – Uniform Mode – Active/Active storage

23

Multi site vSphere Metro Storage ClustervMSC Logical Design – Non-Uniform mode – Active/Active storage

24

Multi site vSphere Metro Storage ClustervMSC Logical Design – Uniform Mode – ALUA storage

25

Site Recovery

VMware SRM Terminology• SRM - Site Recovery Manager• Data Replication types

– HBR – Host Based Replication (async replication with delta 15 min => RPO)– SBR – Storage Based Replication (sync/async replication , sync => I/O write

performance impact)

• SRM Constructs– Protection Group = group of VMs to protect as a single business service– Recovery Plan = RunBook how VMs in Protection Group has to be started

• Failover and Failback process– Failover– Failover-test– Re-protect– Failback

26

SRM Logical Design

27

DC1 (ANT) DC2 (BUD)

vCenter Server

SRM

Authentication

VMs Workload

SRA vSphereReplication

SRM Plug-in

vSphere Client

esx-01 esx-02 esx-XvRA

SAN

LUN01 LUN02 LUNX

vRA vRA

Site A Datacenter

VM VM VM VM

LUN01 LUN02 LUNX

Replicated LUNsNon-Replicated LUNs

vCenter Server

SRM

VMs Workload

SRAvSphereReplication

esx-01 esx-02 esx-XvRA

SAN

LUN01 LUN02 LUNX

vRA vRA

Site B Datacenter

VM VM VM VM

LUN01 LUN02 LUNX

Replicated LUNs Non-Replicated LUNs

Technology

Metro Cluster High Availability or SRM Disaster Recovery?