WHITE PAPER Datrium ControlShift™ Mobility and DR … · 2020-04-25 · datrium’s cotrosift™ mobility and dr orchestratio white paper. white paper →

WHITEPAPER

Datrium ControlShift Mobility and DR Orchestration

CONTROLSHIFT MOBILITY AND DR ORCHESTRATIONWHITEPAPER

© 2020 Datrium, Inc. All rights reserved Datrium | 385 Moffett Park Dr. Sunnyvale, CA 94089 | 844-478-8349 | www.datrium.com

2

Contents1. Introduction 3

2. Legacy Data Protection Architectures 5

2.1 RPO/RTO and Compute Resource Challenges 5

2.2 Complexity and Inefficiency of Juggling Multiple Products 5

2.3 Stretched Clusters and Continuous Data Protection Alternatives 6

2.4 Data Integrity Risks 7

3. ControlShift Integrates All Backup and DR Components Into One System 7

3.1 Low RPO/RTO and Minimal Resource Requirements 8

3.2 Simplicity of a Single Data Stack 8

4. DR Orchestration as a Service 9

5. Eliminating a Secondary DR Site 11

5.1 The Same VMs On Premises and in the Cloud 11

5.2 Ahead-of-Time Deployment of a Cloud DR Site 12

5.3 Just-in-Time Deployment of a Cloud DR Site 12

6. Summary 14

https://www.datrium.com/



3

1. IntroductionDatrium ControlShift™ is a cloud-based disaster recovery (DR) and workload orchestration service for on-premises and cloud environments, and it's a key component of Datrium Disaster Recovery as a Service (DRaaS) with VMware Cloud on AWS. ControlShift provides end-to-end orchestration for workload protection, backup, and replication to the cloud or other on-premises sites, DR plan definition, workflow execution, testing, compliance checks, and report generation.

Figure 1 – User experience with modern ControlShift SaaS app

ControlShift DR plans operate on three different types of sites: Protected, Backup, and Failover. Separate Backup and Failover site designations enable ControlShift to pass down the economic benefits of cloud pay-per-use elasticity to the user with just-in-time creation of a Cloud Failover Site – a Software-Defined Data Center (SDDC) in VMware Cloud on AWS. With ControlShift and DRaaS, VMware Cloud on AWS becomes a foundational piece of a complete cloud-native, on-demand DR solution.

Within ControlShift, the protected site can be any on-premises or cloud system executing VMware workloads covered by a DR plan. The backup site is a physical Datrium DVX or Cloud Backup instance that receives backups from the protected site. The failover site is a physical or cloud-based site, which is designated to take over workload execution following a disaster. The four basic ControlShift use cases are shown below:

Use Case Protected Site Backup Site Failover Site

Prem → Prem → Prem On-Premises On-Premises On-Premises

Prem → Cloud → Prem On-Premises Cloud Backup On-Premises

Prem → Cloud → Cloud On-Premises Cloud Backup VMware Cloud on AWS

Cloud → Cloud → Cloud Availability Zone ACloud Backup, Availability Zone B

VMware Cloud on AWS, Availability Zone B




4

ControlShift enables administrators to organize sites into flexible topologies, so they can adapt these basic use cases to the availability needs of their unique environments. Here are some examples of these topologies:

Prem → Prem → PremIn a simple 2-site topology, a single DVX System serves as both a backup and failover site. The protected site sends snapshot replicas to the backup/failover site. ControlShift orchestrates instant failover to the failover site with no additional data transfer at recovery time.

Prem → Prem → Prem Prem → Cloud → PremIn this common 3-site topology, the protected site sends backups to both the failover site and a cloud backup site. This topology is useful for extra protection, for longer archiving, or as a method of recovering from ransomware attacks that require deep history. When executing failover, ControlShift can either use backups available on the failover site (zero RTO) or fetch them from the backup site.

Prem → Cloud → Prem

This topology is based on a single cloud backup site. It can be used for operational recovery or DR when the on-premises hardware remains operational (e.g., user errors, ransomware attacks). ControlShift failover will use either the local backups on the protected site (zero RTO) or cloud backups.

Extending the previous topology enables recovery from cloud backups even when the disaster destroys the protected site. A new system is procured, followed by ControlShift failover using snapshots stored in the cloud. While the RTO is longer (the backups need to be retrieved by the new system), this is a very cost-efficient solution that makes recovery possible without maintaining a secondary on-premises DR site.

Prem → Cloud → Cloud

This topology combines the economic benefits of eliminating a secondary DR site and replacing it with DRaaS for instant RTO recovery to the public cloud. Cloud Backup is a core component of DRaaS and serves as a cloud backup site. Following a disaster event, a failover site is deployed in the public cloud. ControlShift performs a failover to the newly created cloud failover site. Most cloud infrastructure charges apply only while the cloud failover site is deployed. Only cloud backup charges apply during regular operation.




5

2. Legacy Data Protection ArchitecturesLegacy storage protection architectures rely on tiers of specialized primary and secondary storage appliances and accompanying backup software. In many scenarios, DR requirements are met by dedicated DR orchestration software that is separate from backup software. These architectures evolved during the client-server era and present RPO/RTO, resource utilization, and risk mitigation challenges for modern, hybrid cloud environments. To get a complete data protection solution, IT teams have to tie together many different products from multiple vendors with inherent operational complexity.

2.1 RPO/RTO and Compute Resource ChallengesTraditional backup software runs once a day, and it delivers 24-hour RPO and RTO. Because of the impact and high-resource usage, the “backup window” is most frequently conducted daily during off-hours. While this might be sufficient for some backup scenarios, DR generally has more demanding RPO and RTO requirements. Application owners demand better SLAs for DR that can't be met by legacy backup software because of associated performance bottlenecks and the impact of backups on production workload execution.

Recovery from backups involves retrieving full backups from the backup array to the primary storage array resulting in inferior RTO that might even exceed RPO. This data transfer may go on for many days following a disaster. A 24-hour backup RPO and RTO don't meet modern DR requirements, forcing administrators to deploy other dedicated DR solutions in parallel with backups. A common method for implementing DR with lower RPO and RTO is based on the primary array LUN or volume mirroring.

Array-based LUN mirroring is more efficient for protecting entire sites than backup software replication, because it involves replicating data in its native storage format without rehydrating and transforming the data multiple times by the backup software. Because of its lower resource usage and a smaller impact on production workloads, array LUN mirroring could run on a more aggressive schedule than traditional backups (e.g., every 30 minutes).

While commonly used for DR, array LUN mirroring doesn't eliminate backups because LUN or volume replicas don’t provide full backup capabilities to satisfy regulatory and operational requirements for data protection:

• No deep backup storage: arrays can accommodate only a modest number of LUN snapshots

• No backup catalog

• No visibility inside a LUN

• No recovery of individual VMs or files

One of the main reasons that parallel backup and DR stacks exist is the lack of visibility inside the LUN. The types of data protection provided by the legacy backup and DR solutions are often complementary and make up for mutual deficiencies. For example, moving a mission-critical application between array LUNs might adversely affect LUN-based DR (an application lost upon the LUN recovery from a snapshot), but it's generally handled correctly by backup software policies that are attached to user-visible entities as opposed to storage array LUNs.

2.2 Complexity and Inefficiency of Juggling Multiple ProductsOver time, DR orchestration software has evolved to coordinate DR recovery based on native array mirroring. DR orchestration products are complex distributed systems that integrate with native array mirroring via installable third-party array-specific agents.

The following diagram illustrates the number of data transfers required for a typical data protection architecture integrating best-of-breed backup and DR products. The data protection part alone involves five different data transfers with a majority requiring I/O intensive data transformations. Restoring from backups or a DR failover requires several additional data transformations as well as data transfers that aren't shown in this diagram.




6

Figure 2 – Number of data transfers required for a typical data protection architecture integrating best-of-breed backup and DR products

Backup software keeps data on a backup appliance (a specialized array known as a Purpose-Built Backup Appliance, per Gartner nomenclature). As part of the backup process, the backup software copies recent changes from the primary array to the backup array with the help of hypervisor changed block tracking APIs. Primary storage and backup arrays have different filesystems. Also, backup software typically uses its own client filesystem layered on top of the backup array, and it manages snapshots of protected entities.

Below is an example of a common backup and DR stack deployed on both primary and backup sites. These products come from different vendors and require four independent management consoles.

Primary Storage Array Dell EMC Unity 450F

Backup Array Data Domain DD6300

Backup Software Commvault

DR Orchestration Software Zerto

2.3 Stretched Clusters and Continuous Data Protection AlternativesStretched Clusters and Continuous Data Protection (CDP) are the basis for an alternative DR mechanism. Stretched Clusters aim to provide zero RPO by synchronously replicating every write from the primary to the secondary site. Stretched Clusters impose strict requirements on the inter-site network latency in the 1-5ms range, and it can't protect against regional disasters.

Because each write is replicated over the network in its entirety, Stretched Clusters have high network-bandwidth requirements. Similar to LUN replication, Stretched Clusters require DR orchestration software to coordinate recovery on the secondary site. Similar to array LUN mirroring, Stretched Clusters don't eliminate backups for operational recovery. The resulting architecture remains similar to that described in the previous section.

CDP addresses the rigidity of Stretched Cluster network requirements by relaxing replication from synchronous to semi-synchronous. CDP solutions gained some popularity by providing high levels of data protection for a few carefully chosen workloads. However, it's seldom used as a complete DR solution for the entire enterprise site. CDP products are available as third-party software, and they don't eliminate requirements for backup storage appliances.




7

2.4 Data Integrity RisksMultiple transfers of data with extensive data transformations between different complex products from multiple vendors have inherent data integrity risks. How can the administrator be sure that the backup created by reading the blocks changed between two vSphere VM snapshots stored on a Dell EMC storage array, copied into a Commvault backup stored on a Data Domain appliance, and subsequently replicated over a WAN to a DR site, actually represents the original point-in-time application state? Similarly, DR orchestration software that relies on third-party storage and replication has little chance of detecting in-transit data corruption due to misconfiguration or a software or hardware fault. There are no global end-to-end data integrity checks or APIs that apply across all the multiple hardware and software products from different vendors.

In the end, administrators are left with a complex web of solutions integrating components from three or more vendors leading to increased complexity, ample opportunity for misconfiguration, and staggering levels of resource inefficiencies due to multiple data transformations with no end-to-end integrity checks.

3. ControlShift Integrates All Backup and DR Components Into One SystemUse cases: Prem → Prem → Prem

Prem → Cloud → PremPrem → Cloud → CloudCloud → Cloud → Cloud

ControlShift integrates all aspects of backup and DR in one centrally managed system. This solution has all the benefits of best-of-breed backup and DR products without the associated complexities and inefficiencies of navigating a web of management consoles and excessive resource usage due to multiple data transfers with expensive data transformations.

Figure 3 – ControlShift integrates all aspects of backup and DR in one centrally managed system




8

3.1 Low RPO/RTO and Minimal Resource RequirementsControlShift leverages backups based on native, storage-level snapshots with an RPO of minutes, not hours or days. DVX unifies primary and secondary storage environments, and it natively supports forever-incremental replication with no data transformations. That replication can be done from DVX to DVX, DVX to Cloud Backup, or any VMware workload to Cloud Backup – the latter two use cases can be done as part of a Datrium DRaaS deployment. Backups for any VMware workload can be captured using DRaaS Connect, a key feature of DRaaS.

This snapshot approach enables very aggressive backup and replication schedules with low resource usage and has a minimal impact on the executing workloads. A DR failover requires no additional data transfers – VMs are restarted directly from backups for any available restore point. Because no data transfer is required for recovery, and protected workloads are restarted directly from backups on the DR site, the resulting RTO is instant, similar to RTO of array LUN mirroring used with third-party DR orchestration products.

However, unlike array LUN mirroring, the DVX filesystem provides a full-featured backup solution – backups are accessed with a searchable catalog and kept on cost-effective SATA HDDs with modern data reduction technologies applied at all times. Also, primary copies and local backups share the same storage pool, drastically cutting physical storage requirements for data protection. Cloud Backup, based on AWS S3, provides similar capabilities and cost efficiencies for situations that require cloud-based backup or DRaaS.

3.2 Simplicity of a Single Data StackControlShift eliminates the need for parallel hardware and software backup and DR stacks by integrating all components and aspects of backup and DR in a single system with unified management. A protected DVX and one or more accompanying VMware workloads deployed at another location or in the cloud are managed by a unified cloud orchestration service.

DVX integrates primary and secondary storage, making it possible to use a single management console to establish backup and replication policies and to configure, test, and execute DR plans. Both backup policies and DR plans operate on exactly the same abstractions: backups for VMs and groups of VMs. Because snapshots are at the storage level, ControlShift delivers consistent point-in-time backups across many VMs executing on different servers. This advanced functionality requires native storage integration and isn't available from third-party backup software that relies on hypervisor APIs to take snapshots and copying snapshot state into backups.

Figure 4 – Easy UI to select a protection group snapshot




9

The system’s built-in health checks can pinpoint problems anywhere in the backup and DR stack. For example, replication failures due to network connectivity losses will automatically flag all affected DR plans. ControlShift also automatically performs DR plan compliance checks to ensure the changes in the execution environment don't invalidate DR plans.

3.3 End-to-End Data Integrity ChecksA single data stack backup and DR solution eliminate the risks associated with multiple data transformations and misconfigurations. Because ControlShift controls protected, backup and recovery site endpoints, and orchestrates all movements of data, it also automatically performs end-to-end integrity checks to verify backup fidelity regardless of data location or past replication history. ControlShift uses Automatrix enabling technology, including an efficient algorithm to calculate cryptographic hashes of backups and primary storage to continuously validate data integrity across the entire distributed environment, both on premises and in the cloud.

4. DR Orchestration as a ServiceUse cases: Prem → Prem → Prem

Prem → Cloud → Prem


Cloud → Cloud → Cloud

DR orchestration software products are complex, distributed systems composed of dedicated DR orchestration servers and internal databases often augmented with third-party array software agents. These servers and databases are provisioned per site and need to be licensed, secured, monitored, managed, and upgraded, which requires additional maintenance and extra operations skills. The initial installation and configuration of DR products often require professional services engagements making the overall solution more expensive. DR rollout and upgrade processes are longer because of interaction intricacies when dealing with multiple cross-vendor products and components.




10

ControlShift is delivered as a service: there is nothing to install and nothing to manage. The ControlShift orchestration engine runs as an AWS-based service and leverages the public cloud infrastructure to achieve high availability for its internal operation. DR plans and execution states are replicated across multiple availability zones (AZs) with automatic failover to a healthy AZ, without any data loss if there's a disaster that causes a public cloud outage. Monitoring and upgrades are automated and performed by Datrium as part of the service offering.

The ControlShift service is activated online, making it immediately operational and allowing users to focus on designing and testing their DR plans instead of managing the internal complexities of the DR orchestration software. ControlShift includes all necessary network connectivity and encryption software, and it establishes a secure bidirectional channel between protected sites and the orchestration engine. No external VPN is required.

Figure 5 – ControlShift orchestrates failproof, cost-effective DR across DVXs and from the cloud

This diagram shows the main ControlShift service components. ControlShift employs serverless Lambda functions to automate the initial deployment of other service components and subsequently monitor and heal all deployed services. In the extreme case of the entire AZ going down, the Lambda functions will redeploy all ControlShift components in another AZ restoring service availability.

DynamoDB keeps a cloud service registry used by the Lambda functions. Cloud Backup also uses it for auxiliary metadata indexing. All Datrium services are deployed as Amazon Machine Images (AMIs) into a Datrium-created VPC and Subnet. VPC endpoints used to access all other external services required by ControlShift and Cloud DVX are created automatically, including the endpoints for DynamoDB, S3, and Internet Gateway used by the Datrium VPN.

ControlShift uses AWS Aurora RDS service for its internal transactions, such as saving plans and committing plan execution states. Aurora is highly available, with data replicated six ways across several AZs.

Cloud Backup uses S3 as a repository of backups in a Datrium-native, forever-incremental compressed and deduplicated form. Cloud Backup instances run a copy of the Datrium filesystem designed for efficient handling of backups on cost-optimized spinning disk media, such as S3.




11

5. Eliminating a Secondary DR SiteUse cases: Prem → Cloud → Prem


Replacing an on-premises DR site with a cloud-based DR site has significant CAPEX and OPEX advantages. However, the practicality of existing solutions is severely limited by the lack of hypervisor interoperability between private and public clouds and the associated costs of the public cloud infrastructure.

5.1 The Same VMs On Premises and in the CloudWhile VMware ESX hypervisor dominates on-premises private cloud deployments, public clouds use several other incompatible hypervisors: AWS relies on Xen and more recently on KVM, similar to Google Cloud; and Azure relies on the Microsoft proprietary hypervisor. The translation between VM formats is a brittle and time-consuming process that goes beyond VM disk format conversion. Complex vSphere enterprise environments rely on many other virtualization abstractions that have no immediate analogs in the public cloud: clusters, resource pools, datastores, virtual switches, port groups, etc. vSphere also offers a set of widely-used services based on these abstractions that have no equivalent in the public cloud: vSphere HA, FT, vMotion, DRS, etc.

VMware Cloud on AWS finally makes the transition between private and public clouds robust by presenting an execution environment in AWS that is similar to the on-premises execution environment. No VM conversion is required, VMs retain their native vSphere format, and users get access to the familiar abstractions and management tools following a failover to the cloud – the same tools they used on premises before the failover.

Figure 6 – Creating a DR plan, users follow a process that is identical to Prem to Prem DR




12

As part of a DR plan creation, users map their on-premises virtual infrastructure abstractions (networks, resource pools, folders, datastores, IP addresses, etc.) to the corresponding entities in VMware Cloud, following a process that is identical to that of Prem → Prem DR. The native on-premises VM geometry is fully preserved, as are all virtual hardware devices. The existing in-guest OS drivers continue to function the same way following a migration to the cloud eliminating all risks of VM conversion between different hypervisor types and the associated virtual hardware and guest OS driver changes.

5.2 Ahead-of-Time Deployment of a Cloud DR SiteThis diagram shows an example of a deployed Cloud DR site that maps to a SDDC in VMware Cloud on AWS. In cases where a DR site has a secondary function of executing non-DR workloads during regular operation, an SDDC can be provisioned before failover.

Figure 7 – Example of a deployed Cloud DR site that maps to an SDDC in VMware Cloud on AWS

If the sole purpose of the Cloud DR site is to take over workload execution in the event of a disaster, and it remains otherwise unutilized, further significant cost savings are possible with the just-in-time deployment.

5.3 Just-in-Time Deployment of a Cloud DR SiteWhile replacing an on-premises DR site with a virtual site hosted in the public cloud is attractive for many reasons, by itself, it doesn't necessarily reduce the total cost of the overall DR solution because of the recurring charges for maintaining a cloud DR site. The DR costs are merely shifted from on-premises capital and operational expenses to the recurring costs of maintaining an always-on cloud DR site. A detailed total cost of ownership (TCO) analysis is needed to ensure that the overall cloud DR solution is price competitive with the original on-premises DR solution.

DR related activities don’t contribute to the company’s top-line performance, but they are necessary to mitigate risk. Optimizing the costs of DR is, therefore, an important TCO consideration. Just-in-time deployment of a cloud DR site presents an attractive alternative to continuously maintaining a warm standby cloud DR site. With just-in-time deployment, the recurring costs of a cloud DR site are eliminated in their entirety until a failover occurs, and cloud resources are provisioned.




13

Figure 8 – With just-in-time deployment, cloud DR site recurring costs are eliminated until a failover occurs, and cloud resources are provisioned.

Dedicated on-premises DR sites are typically minimally utilized, resulting in resources being wasted: real estate, power, cooling, compute resources, skilled labor to keep DR sites operational. The on-demand nature of public clouds enables ControlShift to drastically reduce the DR operating costs by deploying the bulk of the DR infrastructure programmatically following a DR event. During steady-state operation, ControlShift maintains a minimal, low-cost AWS cloud footprint to accommodate cloud backups with no ongoing charges for the cloud DR site. The backups are sent to the cloud backup site, and after some processing, land in a cost-effective compressed and deduplicated form in an S3 bucket. In the just-in-time mode of deployment, a cloud DR site is created only after a disaster or ransomware attack. VMware Cloud SDDC, a Cloud DR site with a significantly larger server footprint and associated costs, is deployed only immediately before executing a DR plan.

To make this possible, ControlShift leverages the space and cost efficiencies of Cloud Backup. The protected site replicates VMs or protection groups (PGs) in their forever-incremental format to Cloud Backup, which in turn stores them in a compressed and deduplicated native format within the low-cost S3. During regular operation, the costs of data protection are limited to the costs of the Cloud Backup service and S3 media.

Following a DR event, ControlShift deploys a new SDDC in VMware Cloud on AWS and orchestrates the failover to this SDDC as part of a DR plan execution. This process uses a fast high-bandwidth network link from VMware Cloud to AWS S3 to get access to backups. The recurring charges for the Cloud DR site start accumulating only following the SDDC deployment. The just-in-time deployment of SDDC reduces DR TCO by more than an order of magnitude.




14

Figure 9 – Just-in-time deployment of a Cloud DR site

ControlShift supports an efficient, orchestrated failback following an on-premises site recovery. If upon recovery, the on-premises site retains some pre-disaster data, only the data changes made while executing in the cloud DR site are transferred back to the on-premises protected site.

Ahead-of-time vs. just-in-time provisioning of SDDC is a trade-off between costs and RTO. With ahead-of-time SDDC provisioning, SDDC creation latency could be eliminated. Just-in-time SDDC provisioning drastically lowers the costs but increases the RTO by deploying SDDC only following a failover.

6. SummaryDatrium ControlShift is a cloud-based DR and workload orchestration service that leverages the execution and operational efficiencies of a single integrated data stack to orchestrate all aspects of DR. ControlShift with DRaaS is dramatically easier to use and significantly less resource-intensive than legacy DR solutions resulting in lower RPO and RTO for cloud and on-premises environments. The integrated data and orchestration stack enables consistency-checking of the entire environment, which drastically reduces errors at the time of a disaster. Just-in-time DR to the cloud provides further transformational economics.


Documents

WHITE PAPER Datrium ControlShift™ Mobility and DR … · 2020-04-25 · datrium’s cotrosift™ mobility and dr orchestratio white paper. white paper →