33
2006 Winter Conference February 16 th , 2006 Sheraton Braintree Hotel, Braintree, MA Architecting Complex Business Continuity Solutions Gavin Lewellin Gavin Lewellin Global Practice Manager Global Practice Manager Replication and Data Migration Replication and Data Migration EMC Corporation EMC Corporation Synopsis: This topic will cover the key high-level technical areas that should be considered when evaluating various technology options and strategies for replicating data, including: distance and network quality, data center placement, tiering and ILM, data federation and recovery groups. It also examines the most common best and worst practices that are often utilized when making critical decisions in addition to briefly discussing current trends and strategies.

NEDRIX-Architecting Complex Business Continuity Solutions

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Architecting Complex Business Continuity Solutions

Gavin Lewellin Gavin Lewellin –– Global Practice Manager Global Practice Manager –– Replication and Data MigrationReplication and Data Migration

EMC CorporationEMC Corporation

Synopsis: This topic will cover the key high-level technical areas that should be considered when evaluating various technology options and strategies for replicating data, including: distance and network quality, data center placement, tiering and ILM, data federation and recovery groups. It also examines the most common best and worst practices that are often utilized when making critical decisions in addition to briefly discussing current trends and strategies.

Page 2: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Presentation Agenda• A Quick Technology Background• General Technology Planning Best Practices• Data Center Placement• Distance and Network Quality• Unlocking Latent Demand• Data Federation and it’s Impact on Recovery/Restart• Technology Evaluation Criteria• Current Trends and Strategies

Page 3: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

DR Solution Spectrum

Physical & Logical

Recovery

Traditional DR

Log Shipping Hot Stand-By

Database

Database Replication Techniques

(Oracle Streams and

Adv Replication)

Synchronous

In-System (Local) Replication

Asynchronous

DISASTERDATABASERecovery

Recovery Restart Running

STO

RA

GE

BA

SED

DB

MS

BA

SED

i.e Oracle Data Guard

Hybrid Storage and

Database Solution

Point-In-Time Copies

Page 4: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Common Data Replication Modes• Synchronous Replication• Asynchronous Replication• Point-In-Time-Copies (PITC)• Three Data Center Strategies• Log Shipping• “Hybrid” Solutions

Page 5: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Synchronous ReplicationNo data exposure

Some performance impact

Limited distance

Source

Limited Distance

Target

Asynchronous ReplicationPredictable RPO

No performance impact

Unlimited distance

Source

Unlimited Distance

Target

Synchronous and Asynchronous

Page 6: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Point-In-Time Copies (1)Predictable RPO (Zero – Hours)

Some performance impact

Unlimited distance

Source

Unlimited Distance

Target

Prod

Bunker

Point-In-Time Copies (2)Predictable RPO (Hours)

No performance impact

Unlimited distance

Source

Unlimited Distance

Target

Point-In-Time Copies

Prod

Page 7: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

PrimarySite

Long Distance Site

Sync

AsyncSource Target’

Seconds-Minutes time

lag RPO

Seconds-Minutes time

lag RPOBunker

Site

Target

Zero time lag RPO

Zero time lag RPO

Result:• Bunker Site and Long Distance Site

can be incrementally synchronized to most recent data

• Third link maintains continuous protection

Async

Three Data Center Strategies

Page 8: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Data Files

DataFiles

Other Data

1. Single database instance Synchronous full one time population2. Redo logs copied to archive logs3. Archive logs manually copied to remote site over the network4. Manually apply the archive logs on the remote site to create a database

point of consistency

4

Other Data

12

3

RedoLogs

Archive Logs

Archive Logs

Log Shipping

Page 9: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Data Files

DataFiles

Other Data

1. Single database instance Synchronous full one time population2. Redo logs remain in sync mode3. Redo logs copied to archive logs at log switch4. Archive logs copied to remote site over the network5. Manually apply the archive logs on the remote site to create a database

point of consistency

4

Other Data

12

3

RedoLogs

Archive Logs

Archive Logs

RedoLogs

2

Log Shipping (No data loss)

Page 10: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Data Systems DBMS

Web Server

App Server

File SystemStorage

Web Server

App Server

File SystemStorage

Mirror Activator DBMS

Primary Site Disaster Recovery Site

Block Replicator

“Hybrid” Storage and DBMS ReplicationSybase Mirror Activator addresses the gap created by using either storage

replication or transaction replication alone.– Works in conjunction with storage replication vendors to provide a live standby DBMS

with guaranteed transactional integrity– Extensive testing with EMC including both SRDF/S, SRDF/A (white paper available)– Works with EMC SRDF, IBM PPRC, Veritas Volume Replicator, NetApp SnapMirror, and

Hitachi TrueCopy

Page 11: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Review Related

InformationProtection Programs

1.a

Plan

DesignInfrastructure

Define Business

Requirements

Conduct RecoveryTesting

Test and Implement

Technologies

8

62

4

Manage

ManageResources,

Improvements& Measurement

10

Build

Develop /Update

ProgramDefinition

9

Project Planning

A

Profile Environment

B

Start-up & Preparation

Assess Program/Service Levels

1 Conduct Implementation

Planning

5

EvaluateAvailability and

RecoveryAlternatives

3 DevelopRecovery /

FailoverPlans

7

Program Management and Integration

Business Continuity Program Lifecycle

Page 12: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• What are the differences between Estimation, Modeling and Simulation?– It’s all in the planning....Ask for deliverables with clearly defined caveats...even if

it is via a paid engagement.

General Architecture Best Practices

Design CategoryESTIMATION MODELING SIMULATION

Design Project Cost• Free - $1M

Time to complete• 2 days – 6 months+

Accuracy of Analysis• 40% - 90%

Page 13: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• “Yes, last weekend was my yearly processing peak.”

Things we hear!

• “Oh, we maybe do about 5,000 – 15,000 I/O per second.”

• “We are only going to see 20% growth this year.”

• “We have no network in place. Can you tell us if this solution will work?”

• “We are going to completely re-architect the application(s) 1 week before we want this DR project to go-live.”

• “We bought an OC-3 between Singapore and London. Can you make your solution fit?”

• “We have suddenly got $5 million in our budget that will disappear by the end of this week. We’d like to look at DR.”

Page 14: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Best PracticesCheck the mathematics BEFORE you crack out

the trowel and start building your second data

center.Use simulators and models

to determine performance impacts .

Build second data center too far away and then try to squash workload into

distance induced performance problems.

Determine synchronous performance problems by

“sticking it into production and seeing

what happens”.

Worst PracticesWhat is the maximum Synchronous distance?

Page 15: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

WRITES/SECOND

0.0050.00

100.00150.00200.00250.00300.00350.00

CI9

003

SC

3023

CI9

072

SC

3023

SC

3021

FU90

80

DB

3210

SC

3072

SC

3047

SC

9039

JE30

61

CI2

0A6

SY

3004

SY

3041

JE30

70

CI2

0A6

SC

3083

Volume

Writ

es p

er S

econ

d

Example: Current write response = 2ms then maximum write to single volume = 500 per second per volumeIntroducing an additional 2 milliseconds of latency due to synchronous overheads reduces write count to 250 writes per second per volume

Determining Synchronous Distance• It depends (of course)

– Write distribution (skew-ness)– Current response time– Projected response time– Write access density

Page 16: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

X

DC1DC2

DC3

Synchronous Distance – Case Study• Financial institution• Has 3 sites in place• Requires 4th site as a hub for

replication• Where should the site go?

– $M’s at risk– What is the lowest latency

network available?– Synchronous required

(zero data loss)– Guess could be career altering

Page 17: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Asynchronous: Flow control problem.• All asynchronous replication

techniques use some form of buffering.

• The amount of buffering depends on the data arrival rate and the bandwidth available

• If there is insufficient bandwidth, ultimately, the buffer will over-flow

Page 18: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Asynchronous: Best and WorstBest Practices

• Measure workload (a)• Measure average transfer

size (b)• Determine flow rate (a * b)• Divide by compression ratio• Test link quality• Understand workload

fluctuations

Worst Practices• Don’t measure anything• Don’t test anything• Buy the biggest box you can

(or the smallest)• “Max out the cache” (buy

the maximum amount of cache.

• Run it on a T1

Page 19: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Best Practices• Design the Synchronous and

Asynchronous legs independently.

• Be able to recover from EITHER the synchronous or asynchronous data center depending on which is the most recent.

Worst Practices• Make the synchronous

mistakes– Put the primary and secondary

data centers too far apart– Find out it doesn’t work after

“sticking it into production”

• Make the asynchronous mistakes

– Don’t measure/test anything– Buy the biggest boxes you can– Put in maximum cache– Run it through a T1

Gotcha’s for Three Data Center Strategies

Page 20: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• Validating network characteristics is exceptionally critical.• Bandwidth and Throughput are rarely ever the same value.• Latency and Physical Circuit Distance are never the same.• IP networks are especially prone to packet loss and latency.

Quality of Networks

Latency/Packet Loss

Thro

ughp

ut• 1 Millisecond of latency is equal to (best case) 100km circuit distance.

Page 21: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• Set SLA’s with bandwidth carriers around throughput, packet loss and latency.• Examine Network Acceleration/Fast Write capabilities (i.e NetEx)• Look at running a “Proof of Concept” to validate network improvements.

Improving Network Quality

Latency/Packet Loss

Thro

ughp

ut• Network Acceleration assists with poorly performing networks.

Page 22: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• Switching from Synchronous operations to Asynchronous operations.• Technology refreshes that enhance application/database performance.• Assessing latent demand is an exceptionally difficult task.• You should always periodically measure workloads to assess whether

workload changes will impact your DR capability.

Unlocking Latent Demand

Network Throughput Required (Normal Week)

0.0010.0020.0030.0040.0050.0060.0070.0080.0090.00

100.00

12:0

0:00

AM

7:15

:00

AM

2:30

:00

PM

9:45

:00

PM

5:00

:00

AM

12:1

5:00

PM

7:30

:00

PM

2:45

:00

AM

10:0

0:00

AM

5:15

:00

PM

12:3

0:00

AM

7:45

:00

AM

3:00

:00

PM

10:1

5:00

PM

5:30

:00

AM

12:4

5:00

PM

8:00

:00

PM

3:15

:00

AM

10:3

0:00

AM

5:45

:00

PM

1:00

:00

AM

8:15

:00

AM

3:30

:00

PM

10:4

5:00

PM

Meg

abyt

es/s

ec

Current Workload Latent Demand

Page 23: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• How inter-dependant are your applications and databases?Database Federation and Tiering

OracleTier1

SybaseTier1

ExchangeTier3

IMSTier1

z/OS DB2Tier1

SAPTier2

OracleTier2

OracleTier2

SAPTier3

• Understanding these dependencies will drive how you alter your tiering and will define new classes of “recovery groups”.

• Changes to your teiring and recovery groups will alter the way you need to think about strategically dispersing your production processing.– How federated are my apps?– Tightly-coupled, loosely coupled?– Can “billing” tolerate latency

insertion on every transaction?TIER 3

TIER 2

TIER 1

Page 24: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• How do you best utilize “idle” recovery assets?Geographic Dispersal of Production Workloads

• Clearly define your recovery groups and tiers.– It’s not an all-or-nothing approach.

• Understand app/database tolerances for running active/active over distance.

• Test segregation of workloads regularly.

• Examine “hybrid” solutions as a way to perform secondary tasks at secondary or tertiary data centers.

• Imbed DR Planning in every aspect of change control.

Primary

Secondary

Tertiary

Page 25: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

• Develop assessment criteria by which you can assign weighting based on your needs?

Technology Evaluation Criteria

• Categorize by Solution Architecture Type:– DBMS, Log Shipping, Hybrid Solutions, Array based (Sync, Async, PITC),

Server, Switch/Network, File, Volume/LUN

• Site Strategies Supported:– Single, Dual, Three+, In/Out of Region?

• Disaster Recovery and Operational Recovery/Resumption Point and Time Objectives:

– Define the difference between High Availability and Disaster Recovery.• Architecture Scalability:

– How many hosts? How much storage. What are the performance characteristics?– Do you have reference architectures? What is your install base?– Assess product/solution maturity.

Page 26: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Technology Evaluation Criteria (cont’d)• Deployment and Operational Complexity:

– Is there a migration plan to get me from A to B?– How will I manage this infrastructure?– Are there any monitoring or automation capabilities?

• Product/Solution Functionality Pro’s and Con’s:– Despite a leveling of the playing field, this is still a critical evaluation criteria.– RAID Flexibility, Local and Remote Replication capabilities, integration with

clustering technologies etc...

• Relative Cost:– Storage, Server, Bandwidth, Network.– Operational cost (staffing for new assets and business processes)

Page 27: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Technology Evaluation Criteria (cont’d)• Failover/Failback capabilities:

– What manual intervention is required?– Is this a full re-sync or incremental?– How does the solution handle error injection? (link drops etc).

• Database Federation:– Does the solution support inter-dependant databases/applications to enable

cross-platform restart? (i.e CICS on Mainframe coupled with Oracle etc).– Can I manage multiple Replication Groups/Tiers with this solution?– What if my recovery groups change?

• Supported transmission/network protocols:– Fiber Channel, FICON, Gig-E, 10GIG, SONET, iSCSI etc etc.

Page 28: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

12, 4, 1 hrs60, 30, 0 min15, 5, 0 min0 minDEF

12, 2 hrs15, 5, 0 min15, 5, 0 min0 minGEH

12, 8, 4 hrs12, 8, 4 hrsNo HANo HAABC

DRTODRPOORTOORPO

DR ObjectivesHA ObjectivesBusiness Unit

Evaluation Criteria – RPO/RTO

• Risk Mitigation Requirements (for two or three data centers)– At least one Data Center must be “out-of-region”. – At least one Data Center must not be in a major metropolitan area.– Split GEH production systems between “in-region” Data Centers.– Split ABC productions systems between “out-of-region” Data Centers.

Page 29: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Bi-DirectionalPassiveActiveActive

Active

Active

Active

Active

Active

Active

Site 1Site Dynamic

Active

Passive

Active

Active

Passive

Passive

Site 2

Passive

Passive

Bunker

na

Bunker

na

Site 3

One Way

One Way

Bi-Directional

Bi-Directional

One Way

One Way

Replication Consideration

Three Site

Two Site

Site Solutions

Evaluation Requirements – Site Solutions• Site scenario’s to be analyzed...

Page 30: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

100 %Procedural

( 0 % IT ArchitecturalRedundancy )

24 hrs x 7 days

Manual

Non Critical BusinessSmall Industries

Low Failsafe

Resources

High Failsafe

Essential ServicesUtilities, Airlines, Hospital

BanksFinancial Services

TelecommunicationsFood Manufacturer

Consumer GoodsManufacturing

ManufacturingRetail & Online

Low VolumeHigh Volume

TransportationLogistics

Low security

TransparentFailsafe

High Security

Low security

Single Data Center

Dual DataCenter

Triple DataCenter

Creating a Context• Different Industries require different strategies

Page 31: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Data Center StrategySingle Dual Triple

Dis

aste

r Ty

pe

Regional

EntireData

Center

LocalComponent

Level

Wall StreetX

LargeInsurance

X

SMBX

LargeManufacturing

X

HealthX

LargestBanks

X

So...What is your organization doing?

Page 32: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Technology Trends• The Move to Automation and Monitoring

– Automated Restart Operations.– Integration with Management Frameworks (Tivoli, OpenView etc).– End to end monitoring to pinpoint and diagnose potential problems.

• Three Data Center Strategies– Enabling the ability to provide local High Availability and extended distance

disaster recovery/restart.– Lower cost variants are on the way.

• Geographic Dispersal of Production Workloads– Recovery Group management.– Frequent cycling of workloads across data centers

• Enables more realistic DR testing.– Impact of Virtualization TBD

Page 33: NEDRIX-Architecting Complex Business Continuity Solutions

2005 Annual ConferenceOctober 24-26 2005, Newport, RI

2006 Winter ConferenceFebruary 16th, 2006

Sheraton Braintree Hotel, Braintree, MA

Thank You