Upload
blake-shelton
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
MSI SYSTEMS INTEGRATORSBuilding Better ResiliencyThrough Data Replication
A presentation by:
Alan SalesmanIT Consultant
August 24, 2010
Abstract
• Abstract: Are you feeling the pressure for high availability, need quick recovery time without losing data? MSI will be reviewing approaches to improving your Recovery Time Objectives (RTO) and Recovery Point Objectives(RPO). A discussion around technologies, Service Definition, and Process will help you understand how your critical systems can achieve performance and resiliency.
Industry Terms and Definitions
The Disaster• The catastrophic failure of an IT Service
environment caused by an unplanned event.
That causes… • A significant and extended failure of an IT
service resulting in a severe impact to the business through:• A failure to satisfy a service level
agreement (SLA)• Loss of market share• Revenue loss• Exposure to litigation / legal action
Definitions
Recovery Time Objective (RTO)– is the duration of time and a service level within which a IT service must
be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in IT Service continuity
Recovery Point Objective (RPO)– is the point in time to which you must recover an IT Service or data as
defined by your organization. This is generally a definition of what an organization determines is an “acceptable loss" in a disaster situation
Definitions
Replication– the process of sharing information so as to ensure consistency between
redundant resources, such as software or hardware components, to improve reliability, fault-tolerance or accessibility.
Resilient– An IT service deployment philosophy that describes an IT service that is
able to survive the loss of major components of the IT infrastructure without presenting a total loss of functionality to the user. (The application many continue to operate at an altered level of service.)
Costs per Occurrence
Fre
que
ncy
pe
r ye
ar
1
100
1/100,000
$ $$$ $$$$$
Virus Data Corruption
Worms
Application OutageDisk Failure
Component FailureNetwork Problem
Power Failure
Building FireNatural Disaster
Terrorism/Civil Unrest
Availability Related
Disaster Resiliency
ITSCM – Recovery Related
IT Service Continuity Management Spectrum
Evolution of Disaster Resiliency
2010• IT has grown to be a strategic
center of companies, not just a cost center
• Globalization and collaboration open IT environment
• Proactive Disaster Resiliency/Data Protection
1990•IT is a competitive Advantage
•Proprietary Systems
•Reactive Disaster Recovery
Pressure from Both Sides
Business Pressure
•Globalization
•True 24/7 availability
•‘”Mobilization” of Data – user access at anytime or anyplace
•All data critical
•Compliance/Regulatory
IT Pressure
•Increased Data Capacity
•Data growth at 30% per year
•Increased Data Value
•Maintenance windows few and far between
•Outages are visible, some make the news
Tim
e and Expectations
Replication Model Key Characteristics
• Data replication is performed for critical data• Consistency of replicated data is supported by synchronous or asynchronous
techniques• Recovery point is based on timing of consistency group creation and tape backup
timing• Recovery environment provides servers and supporting infrastructure for key
applications• Servers are inactive, excepts to support non-disk based replication• Tests can be performed and should be conducted on a regularly scheduled basis• Facilities are provided by a vendor hot site/co-location agreement, company
owned internal private cloud or external private cloud • Configurations for infrastructure represent a subset of what is needed to support
production levels, plus what is needed to support replication method• Network bandwidth is established to support the volume of data replication
Infrastructure Replication
Performed at the infrastructure layer (synchronous or asynchronous) • Covers both geo-proximate or Geo-Remote dual data center designs. • Application neutral and lightweight on the operating system.• Asynchronous delivery can require up to three times the amount of physical storage capacity. • Requires an high bandwidthExample Services are found in:• Peer-to-Peer Remote Copy (PPRC) (IBM)• Fast Remote Mirroring• FlashCopyRecommended Infrastructure/Policy:• Geo-Remote data centers with high bandwidth connectivity• Redundant/Duplicate Infrastructure• A good job scheduler for asynchronous replication• Defined Policy/process for recovery or restarting of Services• Defined Process for Declaration of DisasterRTO/RPO Target• Target can be 24 hours or less for RTO• Target can be less than 4 hour RPO
Client Focus
• Support sophisticated recovery initiative by implementing storage replication technology to:– Reduce Recovery Time (RTO)– Recovery Point (RPO)
MSI Role
• Performed a review of IT services included within scope to determine the impact of the new storage recovery environment on recovery exercises, outages and daily processing.
• Developed operational recovery procedure recommendations to leverage the capabilities of the new recovery environment and improve upon RTO/RPO objectives.
• Developed recovery test timeline recommendations to improve the quality and reduce the duration of recovery test exercises.
Beginning State
• All mainframe IT services have the same recovery time and recovery point objectives.
• Recovery time objective was 72 hours, which could not be achieved.
• Recovery exercises were performed during 64 hour sessions at a vendor provided recovery site.
• The test process could be completed within 64 hours, but the process was a subset of what would be necessary for a full recovery.
Beginning State
• The recovery point objective was stated as being no greater than 24 hours from time of failure.
• Disaster recovery tapes were not taken off site until 3 pm each day.
• The time between tape creation and off-site rotation introduced an additional eight hours to the recovery point, or a total of 32 hours.
• On weekends and holidays, cycles were generally not scheduled, therefore backups and off-site rotation did not occur, leaving longer time periods of exposure.
Technology
• (2) DS8300 Storage Subsystems– Flashcopy– Global Mirror
• PPRC• Asynchronous replication
– Storage instances• A Copy - Production Local DS8300• B Copy – Performance Remote DS8300• C Copy – Consistency Remote DS8300• D Copy – DR Test Remote DS8300
Strategic Shift to Internal Recovery
• Driven by inflexible recovery vendor:– Pricing of services– Definition of roles– Access to facilities– Contention for tests and regional disasters
• The existence of a secondary data center• Mainframe cost mitigated by Capacity BackUp (CBU)
offering• Open systems also need to be addressed, magnifying
inflexibility• Cost concerns for extended stay at recovery site
Support Recommendations
• Establish clear ownership of remote storage, to include:– FlashCopy schedules.– Bandwidth utilization.– Global Mirror and Consistency Group validation and
monitoring process.– Change control.
Support Recommendations
• Volume specific FlashCopy methods:– Page datasets and coupling facility datasets should be
replicated once to create an instance on the recovery platform and never replicated again.
– JES2 Spool and secondary checkpoint should follow the same FlashCopy scheme as other production volumes.
– SYSRES volumes should be replicated once per month, following a successful IPL.
– Development and Test volumes should be replicated, and FlashCopied once per day.
Support Recommendations
• Production volume consistency group creation schedule:– During prime shift create consistency groups at system
default interval of 1-2 minutes, to support CICS and DB2.– Pause Global Mirror operation before cycle and resume
after cycle, to support batch processing.– For DR test exercises, perform FlashCopies to DR test
instance at a point in time to satisfy the test requirements.
– Plan to implement TPC for Replication for advanced capabilities.
Recovery Point Improvements
Recovery Point in Time Comparison
0
5
10
15
20
25
30
35
Tim
e of
Day
13:0
014
:00
15:0
016
:00
17:0
018
:00
19:0
020
:00
21:0
022
:00
23:0
00:
001:
002:
003:
004:
005:
006:
007:
008:
009:
0010
:00
11:0
0
Rec
ove
ry P
oin
t in
Tim
e
Before
After
Total Recovery Improvement
Total Recovery Time to Point of Outage
0
10
20
30
40
50
60
70
80
Tim
e of
Day
13:0
014
:00
15:0
016
:00
17:0
018
:00
19:0
020
:00
21:0
022
:00
23:0
00:
001:
002:
003:
004:
005:
006:
007:
008:
009:
0010
:00
11:0
0RT
O +
RP
O t
o P
oin
t o
f O
uta
ge
Before
After
Summary
• Shift to internal recovery model using:– Storage replication– Capacity backup mainframe
• Shift from “restore” to “restart”• Significant improvement in recovery objectives
Core Efficiency Technologies
DeduplicationSaves up to 90% for full backups
Saveup to
90%
SnapshotSpace efficient imaging for data protection
Saveover
70%
Storage VirtualizationPooling and sharing heterogeneous storage resources
Deduplication
• Deduplication removes redundant data blocks from volumes, regardless of application or protocol
• With deduplication, users can recoup 50% or more of their capacity for many data sets and environments
• Deduplication can be used for primary, secondary, and archival storage tiers
Snapshots
• Locally retained point-in time copies of file systems which you can use to protect data– Single files or complete backup and recovery
• Block-incremental behavior limits associated storage capacity consumption
• Reliable off-media backups without the need for long backup windows
• Simplifies the process of recovering, duplicating, or archiving data
Storage Virtualization
• Designed to improve the flexibility and utilization of your storage resources
• Pools your storage volumes, files and file systems into a single reservoir for centralized management
• Works with heterogeneous storage systems
• Reduce the effects of hardware configurations and helps support business continuity
Where do you start?
• Understand your current Resiliency/Recovery capabilities, limitations, and risks (BIA, Risk Assessment)
• Develop Service Inventory– Prioritize and define the value/importance to the business of that IT Service, i.e.
Tier 1, 2, or 3• Define Data Protection Methods• Review/Define Service Level Objectives/Agreements• Document Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for
each IT Service• Design and Build the resiliency model to support the business supported IT Services• Develop Data Migration Plan
Develop Service Inventory
• Define tier specific patterns for – Virtualization and Cloud Opportunities– Daily operational quality of service– High Availability requirements – Recovery requirements– Data protection requirements
• Establish service placement in each tier• Develop budgetary roll-up of all services by tier• Perform service delivery chain mapping for critical
systems
Sample IT Service Tiering
Tier Definitons:
Target RPO/RTO for 2010: RTO: <24 Hours RPO: <12 Hours RTO: <72 Hours RPO: <24 Hours RTO: 7-10 Days RPO: <24 Hours RTO: 30 DaysRPO: <24 Hours
Services by Tier:
Sharepoint (non-critical)
Tier 1 Tier 2 Tier 3 Tier 4
Communications/Infrastructure Critical Applications Non-Critical Applications Development / TestNetwork Phone System client websites (no SLA) TFS - Firewall - Call Manager QC - Telecom - Unity Oracle ERP - Router - IPCC XXX Apps - Switch SharePoint (intranet) Move-it DMZ - VPN - intranet.XXX.org Records Management - ACS client websites (w/ SLA) License Servers (Cached)Email Infra SmartFilter - Exchange License Servers (No Cache) DMS (2010) - Blackberry ProjectWisePhones SIP (2010) - SRST
Define Data Protection Patterns
• Develop an inventory of data types supporting complete service delivery
• Based on data protection requirements, determine best replication methods
• Develop estimated bandwidth and budgetary role-up costs to support the preferred replication methods
• Define roadmap or incremental implementation approach for desired replication methods which are outside current approved funding
Data Migration Planning
• Discover characteristics of the source data• Based on data protection requirements, determine
data migration requirements• Develop data migration plan• Deliverables:
– Initial and incremental roadmap for replication implementation
Conclusion
• There is no one-size-fits-all approach • Most businesses do not fully understand how vulnerable they are using existing recovery
processes.• Companies who are looking to improve their RTO/RPO approach need to:
– Look carefully at their business requirements and risk– Understand IT demands by understanding the applications, the operating system,
storage, network bandwidth requirements and the total business impact. • Relating the business support needs to IT Service Management can assist them in
determining which approach is the most suitable. – Sometimes a combination of drawing elements from several approaches works best. – You can mix and match in order to tailor a solution that meets unique needs of any
business unit