Upload
mike-alvarado
View
77
Download
0
Embed Size (px)
Citation preview
Today’s Itinerary
• A Blue Print Primer
• A Blue Print Based Disaster Recovery Solution
• Our Demo Configuration
• Failure, Data Loss: Fail Over
• Repair, Planned Fail-Back
• Analysis of the Solution and the Blue Print
It’s The Data, Stupid!!
• 60% of companies that lose their data go out of business within 6 months!
• 93% of companies that lose the data center for 10 days or more file for bankruptcy within one year! (NA&RA)
• Conclusion: The two critical components of DR are: – Protecting the data, and
– To be able to resume quickly after disasters and losses!
Cisco and Blueprints
• 2002: Cisco campus on Tasman Drive:
– Every building was identical!
• Identical blue print for all infrastructure:
– Water, Sewer, Power, Networking, Fire safety, ..
– Unifies all maintenance, management, upgrades, ..
• The entire campus was built much faster and more efficiently “by cookie cutter”!
– Populating the campus was done in a hurry. Everything looked the same.
– Buildings were reusable as needs changed.
Our Blue Print Objective
• Achieve the same level of efficiency with IT solutions as Cisco did with their campus blue print!
• Reuse of architecture and modules to assure rapid an efficient scalability and unified management framework.
• Enable rapid creation of new solutions within the same architectural framework.
• Enable easier capacity scaling without affecting IT management.
DR/DP Solution Blue Print TOC:
• The Platform: – The Physical: Sites, hardware, network
– Virtualization: Server, storage, network
• The Applications to protect
• The Solution Engine: Backup software
• Use cases that any DR solution must support
• Normalized work flows implementing the use cases.
• Questionnaire to assist adaption and customization
Blue Print Template: The Architectural Principles
• Extensive use of virtualization in servers, storage and networking.
• Integrating layers of the most cost effective products.
• Creating a “building blocks” approach to solutions design and implementation.
• Creating designs that can easily be adapted, extended and customized to meet specific requirements.
• Unified architecture to simplify management of assets and decision points.
• The results are sophisticated and complex, but integrated and efficient business solutions
Blue Print Summary
• Virtualization + sound architectural principles enables a more effective approach to building and implementing solutions.
• Customers gets more sophisticated and more reliable solutions per $$.
• Resellers and integrators can reuse the building blocks and handle more customers in a shorter amount of time
• Reuse of building blocks implies simplified and uniform management.
Our DR Solution
• We have a business critical application.
• We want to protect it with a DR solution so that in the case of loss of disk, server, site etc., we can rapidly fail over to the DR instance.
• Our platform consists of VMware, NetBackup and Windows on top of NEC’s servers, storage and network.
Objectives and Requirements
• RPO: How much data/time are we willing to lose? – I.e. scheduled updates every RPO/2.
• RTO: How soon must we operational? – This is total time recovery time. It includes:
• The time to make the fail-over decision; • The time required to failover and start DR instance once the
decision has been made.
• DR or DP? – DR: DR site can support all critical applications. – DP: Only the data is protected. Repair of and recovery
is to the production site? – Short RTO => DR standby capability!
RPO and RTO
• RPO: Predominantly a WAN bandwidth issue. – WAN bandwidth is expensive!! – More frequent snapshots -> higher data change volume/hr. – Note that snapshots are inexpensive, they do not affect
application performance.
• RTO: Total Fail-Over Time: – When the decision is made, how fast to operational
applications at DR site? – Cheat!! By updating your DR instances after every
completed update!
• How reliable is it! – Test often!! Or even better, perform planned fail-over/back
regularly!!
NIC HBA
ESX Hypervisor
NEC FT-Gemini
Guest
App
Guest Guest
App
App
Guest
VM1 VM2 VM3 VMn
LUN
App …
Array:
FC/iSCSI SAN
LUN
SAN: NEC D3/D4 array with FC or iSCSI connect:
NEC 5800 FT Server running VMware VMs with applications:
Array:
HBA
TCP/IP Network
NIC
App
NEC Hydrastor
VM
VM
VM
LUN
LUN
Standby Servers and storage
DR/DP Site:
Backup from Production site
Preparatory restore for DR
ESX Hypervisor
…
NB Client
Required Use Cases
• Unplanned and planned fail-over to the DR instance. – Planned: for maintenance or for testing.
• Planned and unplanned (!!) fail-back. – Unplanned when the DR failover is aborted.
• Recovery of lost files or folders back to the production instance. – This may include the entire disk/file system.
– The backup software (NetBackup )provides this capability all by itself.
In Your World
• What is the typical range for RPO and RTO in the DR solutions you build?
• How do they vary across types or sizes of companies?
• Other requirements you see that challenge the budget, your efficiency as designers or implementers, etc?
• What is the % split between full DR and DP only solutions? Is it changing?
Our Demo Solution
• A VM with a file system holding user data.
• For demo purposes our objectives are tight.
– RPO: 5 minutes:
• I.e. scheduled updates every 2.5 minutes.
– RTO: 5 minutes:
• Total time recovery time. Includes the time to make the fail-over decision and the time required to start DR instance once decision to fail over has been made.
Ready, Set, Fail
• Kill the disk or VM
• Verify service is dead
• Show solution or site diagram with big red mark on failed component
NetBackup Backup Host and Media Server
VM
VM
VM
LUN
LUN
Standby servers, storage
Production Site:
DR/DP Site:
VM1 VM2
LUN FS LUN
LUN
VM n
LUN NB
client
Backup to Archive Preparatory restore for DR
NB Agent
ESX Hypervisor
ESX Hypervisor VM
i … …
NB Media
Svr
FS
FS client
FS
Fail-Over: Should/shouldn’t
• If you fail over you WILL lose some data.
• If you can resume in less than RPO time, net win!
• If you can wait it out for some time, a bit beyond the RTO, low risk and convenience.
• Otherwise we’ll push the button and start the fail-over work flow
NetBackup Backup Host and Media Server
VM
VM
VM
LUN
LUN
Standby servers, storage
Production Site:
DR/DP Site:
VM1 VM2
LUN FS LUN
LUN
VM n
LUN NB
client
Backup to Archive
Prep. restore for DR
NB Agent
ESX Hypervisor
ESX Hypervisor VM
i … …
NB Media
Svr
FS
FS client
FS
Failing Over, UC-DR-2:
Work Flow for UC-DR-2:
1. Shut down any remaining production side VMs that are part of the application. – This prevents data corruption and network configuration
errors.
2. In parallel with step 3, begin performing any required network reconfiguration required for all clients to reach the DR solution instance. Save the last step that enables client access.
3. Start the DR instances of the component applications in the prescribed order to bring the business application on-line. Verify data/operational integrity.
4. Perform the last network configuration steps to allow clients to connect to the DR instance of the application.
Work Flow Verification
• Verify with PC that we have completed the fail-over.
– Connect to DR instance with PC
• Verify that we did lose some data, changes that occurred after the last complete update.
The Fail-Back Decision
• Have we repaired the production instance?
• Have we tested the new production instance?
• What is the optimal time for a planned fail-back?
– Does the organization have a time period that is more convenient for application shutdown?
NetBackup Backup Host and Media Server
VM
VM
VM
LUN
LUN
Standby servers, storage
Production Site:
DR/DP Site:
VM1 VM2
LUN FS LUN
LUN
VM n
LUN NB
client
Backup to Archive Preparatory restore for DR
NB Agent
ESX Hypervisor
ESX Hypervisor VM
i … …
NB Media
Svr
FS
FS client
FS
Planned Fail-Back: UC-DR-5
1. Shut down the DR instance of the application, running at the DR site.
2. Start reconfiguring the network an all other infrastructure components for steering clients to the production site instance of the application. Note that for steps 3 and 4, which are described below, they can be carried out in parallel with this step.
3. Perform the last NetBackup backup cycle for the LUNs in the DR site to the backup archive.
4. Perform a restore from the backup archive to the LUNs in the production site.
5. When steps 2 - 4 have all completed, start up the restored production instance of the application. Clients will now have full service.
6. Restart NetBackup protection of the production isntance to the backup archive at the DR site.
Verify restored service ..
• Verify that the new production instance is accessible:
– Connect to the repaired VM/FS/LUN and verify that service from the production instance have been restored.
• Verify that all data set changes made at the DR instance has been transferred to the production instance.
Discussion Topics
• How can we shorten the application downtime? – Iterative recovery: (Level 0 +) Incremental
“backups” to the repaired instance shortens the last update.
– Large data set, limited WAN bandwidth: Out-of-band restore.
• Testing the new instance: – Start it up and access data using the next to last
incremental above.
Blue Prints: Why A Better Result
• Using tested architectural principles and building blocks that are common to many solutions.
• The solution is based on a well proven architecture.
• The building blocks have been “road tested” with other solutions and customers.
• The same BP enables easier integration across new applications: – Easy to protect new applications by simply expanding the
existing DR solution
• The resulting solution is easily integrated into the management framework.
Blue Prints: Efficiency!
• Starting out with an 80% complete design, and a well understood set of building blocks.
• The same BP enables easier integration across new applications: – Easy to protect new applications by simply expanding the
existing DR solution
• Adaptation of the blue print to the customers’ specific environments and requirements are guided by the questions in the questionnaire – These variations do not break with the blue print, the principles
are still intact
• Scalability is built in at the application, solution and infrastructure level.
Meeting The Requirements?
• Does the solution meet our requirements?
• How many (%) of the DR solutions you have built or managed have tighter requirements?
– Or looser requirements?
– Should any of that be reflected in the blue print?
Improving the Solution
• What questions do we ask, what data do we collect, what experience do we need to take this solution to the next iteration?
Improvements, Savings
• How can we improve the solution for the customer?
– Can critical applications (customer db, ..) run on a different schedule than the common set (email, home dir)?
– How much to pay for the next amount of WAN bw?
• How can we save $$s for the customer?
– Reducing the DR foot print?