59
1 Architect’s Guide to Designing Integrated Multi-Product HA-DR-BC Solutions John Sing, Executive Strategy, IBM Session E10

2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

Embed Size (px)

DESCRIPTION

In today's sophisticated IT Cloud world, how do I fuse multiple technologies, products, and clouds together to create a 2012 integrated High Availability, Disaster Recovery, Business Continuity IT solution? This session complements product-specific and Overview HA/DR/BC sessions by providing proven, product-agnostic methodology to architect such a solution, including petabyte-level considerations. We provide pragmatic industry-proven, step-by-step methodology / toolset for you to use to work directly with clients to a) crisply elicit, distill HA/DR/BC requirements b) efficiently organize, map requirements to c) design a integrated multi-product, phased-approach, IT HA/DR/BC solution which properly combines backup/restore software, tape, tape libraries, dedup, point-in-time and continuous disk replication, and storage virtualization products c) provide template to clearly communicate solution, gain consensus across multiple levels of operations and management. John Sing is author of 3 IBM Redbooks, including SG24-6547-03 IBM System Storage Planning for Business Continuity. My only request when referencing this material in your work, is that you give full credit to me, John Sing, and IBM, as the authors of this material, research, and methodology. That having been said, please spread the good word.

Citation preview

Page 1: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

1

Architect’s Guide to Designing Integrated Multi-Product HA-DR-BC SolutionsJohn Sing, Executive Strategy, IBM Session E10

Page 2: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

2

John Sing • 31 years of experience with IBM in high end servers, storage, and software

– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and Planning, Enterprise Large Scale Storage, Internet Scale Workloads and Data Center Design, Big Data Analytics, HA/DR/BC

– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business Continuity, HA/DR/BC, IBM Storage

– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror)

– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and

VSE/ESA)

[email protected]

• IBM colleagues may access my webpage:– http://snjgsa.ibm.com/~singj/

• You may follow my daily IT research blog– http://www.delicious.com/atsf_arizona

Page 3: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

3

Agenda

• Understand today’s challenges and best practices

– for IT High Availability and IT Business Continuity

• What has changed? What is the same?

• Strategies for:– Requirements, design, implementation

• Step by step approach– Essential role of automation– Accommodating petabyte scale– Exploiting Cloud

3

2012 Clouddeployment

options

Page 4: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

4

Agenda

1. Solving Today’s HA-DR-BC Challenges

2. Guiding HA-DR-BC Principles to mitigate chaos

3. Traditional Workloads vs. Internet Scale Workloads

4. Master Vision and Best Practices Methodology

Page 5: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

5

Recovering today’s real-time massive streaming workflows is challenging

Chart in public domain: IEEE Massive File Storage presentation, author: Bill Kramer, NCSA: http://storageconference.org/2010/Presentations/MSST/1.Kramer.pdf:

n d

Page 6: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

6

Today’s Data and Data Recovery Conundrum:

Page 7: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

7

Many options, including many non-traditional alternatives for user deployments, workload hosting, and recovery models

Traditional alternatives:

• Other platforms

• Other vendors

• Non-traditional alternatives: – The Cloud, the Developing World

Illustrative Cloud examples onlyNo endorsement is implied

or expressed

Inter-

Disciplinary

Page 8: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

8

Finally, we have this ‘little’ problem regarding Mobile proliferation

• From IT standpoint, we are clearly seeing “consumerization of IT”

• Key is to recognize and exploit hyper-pace reality of BYOD’s associated data

• Not just the technology

• Also the recovery model (“cloud), the business model, and the required ecosystem

Clayton ChristensenHarvard Business School

http://en.wikipedia.org/wiki/Disruptive_innovation

Page 9: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

9

So how do we affordably architect HA / BC / DR in 2012?

Page 10: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

10

What has remained the same?

Data Protection Service Management Storage Efficiency

(Continued good Guiding Principles that mitigate HA/DR/BC chaos)

Page 11: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

11

Application 1Application 3Analytics

report

managementreports

http://xyz.xml

decisionpoint

MQseries

WebSphere

Application 2

SQL

db2

Businessprocess A

Businessprocess B

Businessprocess C

Businessprocess D

Businessprocess E

Businessprocess F

Businessprocess G

Infr

astr

uctu

reA

pp

licati

on

Bu

sin

ess

1. An error occurs on a storage device that correspondingly corrupts a database

2. The error impacts the ability of two or more applications to share critical data

3. The loss of both applications affects two distinctly different business processes

IT Business Continuity must recover at the business processlevel

The Business Process is still the Recoverable Unit

Page 12: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

12

Application 1Application 3Analytics

report

managementreports

http://xyz.xml

decisionpoint

WebSphere

Application 2

SQL

db2

Businessprocess A

Businessprocess B

Businessprocess C

Businessprocess D

Businessprocess E

Businessprocess F

Businessprocess G

Infr

astr

uctu

reA

pp

licati

on

Bu

sin

ess

1. Data input to the cloud

2. Cloud provider outage

3. The loss of Cloud output affects two distinctly different business processes

Cloud is simply another deployment option

But doesn’t change HA/BC fundamental approach

Cloud does not change business process; still the recovery unit

STOP

Page 13: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

13

When can Cloud recovery can provide extremely fast time to project completion?

• Where entire business process recoverable units can be out-sourced to Cloud provider

– Production example: Out-sourcing production, or backup/restore, or integrated, standalon, application to a provider

– Cloud application-as-a-service (AaaS) example: Salesforce.com, etc.

Application 1Application 3Analytics

reportmanagement

reports

http://xyz.xml

decisionpoint

MQseries

WebSphere

Application 2

SQL

db2

Businessprocess A

Businessprocess B

Businessprocess C

Businessprocess D

Businessprocess E

Businessprocess F

Businessprocess G

Tech

nic

al

Ap

plicati

on

Bu

sin

ess

Page 14: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

14

The trick to leveraging Cloud is:

Understanding that Cloud is simply another (albeit powerful) deployment choice

Good news:

Fundamental principles for HA/DR/BC haven’t changed

It’s only the deployment options that have changed

Page 15: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

15

Still true: synergistic overlap of valid data protection techniques

Protection of critical Business data Operations continue after a disaster

Costs are predictable and manageableRecovery is predictable and reliable

Fault-tolerant, failure-resistant streamlined infrastructure

with affordable cost foundation

1. High Availability Non-disruptive backups and

system maintenance coupled with continuous availability of

applications

2. Continuous Operations Protection against unplanned

outages such as disasters through reliable, predictable

recovery

3. Disaster Recovery

IT DataProtection

Page 16: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

16

Four Stages of Data Center Efficiency: (pre-req’s for HA/BC/DR)

http://public.dhe.ibm.com/common/ssi/ecm/en/rlw03007usen/RLW03007USEN.PDF http://www-935.ibm.com/services/us/igs/smarterdatacenter.html

April 2012

Page 17: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

17

Done?

?

Still true: Timeline of an IT Recovery ==>

Production ☺ Network Staff

Operations StaffOperations Staff

Data

Operating System

Physical Facilities

Telecom Network

Management Control

Execute hardware, operating system, and data integrity recovery

AssessRPO

Application transactionintegrity recovery

Applications

Now we're done!

Applications Staff

Recovery Time Objective (RTO)of transaction integrity

Recovery Time Objective (RTO)of hardware data integrity

Recovery Point Objective

(RPO)

How much datamust be

recreated?

Outage!

RPO

Telecom bandwidth still the major delimiterfor any fast recovery

Page 18: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

18

?

Still true: value of Automation for real-time failover ===>

Production ☺ Network StaffOperations StaffOperations Staff

Data

Operating System

Physical Facilities

Telecom Network

Management Control

AssessRPO

Trans.Recov.

Applications

Now we're done!

Applications Staff

RTO trans. integrity

RTO H/W

Recovery Point Objective

(RPO)

How much datamust be

recreated?

Outage!

RPO

HW

•Reliability

•Repeatability

•Scalability

•Frequent Testing

Value of automation

Page 19: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

19

Recovery Time Objective (guidelines only)

15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days

Co

st

/ Va

lue

BC Tier 4 – Add Point in Time replication to Backup/Restore

BC Tier 3 – VTL, Data De-Dup, Remote vault

BC Tier 2 – Tape libraries + Automation

BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery

BC Tier 6 – Add real-time continuous data replication, server or storage

BC Tier 1 – Restore from Tape

Still true: Organize High Availability, Business Continuity Technologies Balancing recovery time objective with cost / value

BC Tier 5 – Add Application/database integration to Backup/Restore

Recovery from a disk image Recovery from tape copy

Page 20: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

20

Tape Backup

SecsMinsHrsDays Wks Secs Mins Hrs Days Wks

Recovery PointRecovery Point Recovery TimeRecovery Time

Synchronous replication / HA

Periodic Replication

Asynchronous replication

Still true: Replication Technology Drives RPO

For example:

Page 21: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

21

Recovery Time includes:

– Fault detection

– Recovering data

– Bringing applications back online

– Network access

Manual Tape Restore

SecsMinsHrsDays Wks Secs Mins Hrs Days Wks

Recovery PointRecovery Point Recovery TimeRecovery Time

End to end automated clustering

Storage automation

Still true: Recovery Automation Drives Recovery Time

For example:

Page 22: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

22

Integration into IT ManageBusiness Prioritization

StrategyDesign

riskassessment

businessimpactanalysis

Risks,

Vulnerabilities

and Threats

programassessment

Impacts

of

Outage

RTO/RPO

•Maturity Model

•Measure ROI

•Roadmap for Program

ProgramDesign

Current

Capability

Implement programvalidation

Estimated

Recovery Tim

e

ResilienceProgram

Management

Awareness, Regular Validation, Change Management, Quarterly Management Briefings

Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot be resilient without having strategies for alternate workspace, staff members, call centers and communications channels.

crisis team

businessresumption

disasterrecovery

highavailability

1. People2. Processes3. Plans4. Strategies5. Networks6. Platforms7. Facilities

Database andSoftware design

High Availability Servers

Storage, Data Replication

High Availabilitydesign

Source: IBM STG, IBM Global Services

Still true: “ideal world” construct for IT High Availability and Business Continuity

Page 23: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

23

The 2012 Bottom line: (IT Business Continuity Planning Steps)

For today’s real world environment……….

Integration into IT ManageBusiness Prioritization

StrategyDesign

riskassessment

businessimpactanalysis

Risks,

Vulnerabilities

and Threats

programassessment

Impacts

of

Outage

RTO/RPO

• Maturity Model

• Measure ROI

• Roadmap for Program

ProgramDesign

Current

Capability

Implement programvalidation

Estimated

Recovery Tim

e

ResilienceProgram

Management

Awareness, Regular Validation, Change Management, Quarterly Management Briefings

crisis team

businessresumption

disasterrecovery

highavailability

1. People2. Processes3. Plans4. Strategies5. Networks6. Platforms7. Facilities

Database andSoftware design

High Availability Servers

Data Replication

high availabilitydesign

i.e. how to streamline this “ideal” process?1. Collect information for prioritization 2. Vulnerability, risk assessment, scope3. Define BC targets based on scope4. Solution option design and evaluation5. Recommend solutions and products 6. Recommend strategy and roadmap

4. Solution option design and evaluation5. Recommend solutions and products 6. Recommend strategy and roadmap

2012 key #2:

Workload type

2012 key #1:

need a basicData Strategy

Need faster way than even this simplified 2007 version:

Page 24: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

24

Streamlined BC ActionsInput Output

2. Vulnerability / Risk Assessment

List of vulnerabilities Defined vulnerabilities

3. Define desired HA/BC targets based on scope

Existing BC capability, KPIs, targets, and success rate

Defined BC baseline targets, architecture, decision and success criteria

4. Solution design andevaluation

Technologies and solution options

Business process segmentsand solutions

5. Recommend solutions and products

Generic solutions that meet criteria

Recommended IBMSolutions and benefits

1. Collect info forprioritization

Business processes, Key Perf. Indicators, IT inventory

Scope, Resource Business Impact

Component effect on business processes

6. Recommend strategy and roadmap

Budget, major project milestones, resource availability, business process priority

Baseline Bus. Cont. strategy, roadmap, benefits, challenges,financial implications andjustification

2005 version

Page 25: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

25

Streamlined BC ActionsInput Output

2. Vulnerability / Risk Assessment

List of vulnerabilities Defined vulnerabilities

3. Define desired HA/BC targets based on scope

Existing BC capability, KPIs, targets, and success rate

Defined BC baseline targets, architecture, decision and success criteria

4. Solution design andevaluation

Technologies and solution options

Business process segmentsand solutions

5. Recommend solutions and products

Generic solutions that meet criteria

Recommended IBMSolutions and benefits

1. Collect info forprioritization

Business processes, Key Perf. Indicators, IT inventory

Scope, Resource Business Impact

Component effect on business processes

6. Recommend strategy and roadmap

Budget, major project milestones, resource availability, business process priority

Baseline Bus. Cont. strategy, roadmap, benefits, challenges,financial implications andjustification

Do basic HA/DR

Data Strategy

Exploit

Workload Type

2012 version

Page 26: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

26

How do we get there in 2012?

Bottom line #1: have a basic Data Strategy

Bottom line #2: Exploit Workload type

Data Protection Service Management Storage Efficiency

Page 27: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

27

i.e. #1: It’s all about the

Data

Now, what do I mean by that?

Page 28: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

2828

Applicationscreate data

InformationArchive / Retain / Delete

What is a basic Data Strategy? Specify data usage over it’s lifespanF

req

uen

cy o

f A

cces

s an

d U

se

Time

Informationand data

Management

Page 29: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

29

Business processes drive strategies and they are integral to the Continuity of Business Operations. A company cannot be resilient without having strategies for alternate workspace, staff members, call centers and communications channels.

Integration into IT ManageBusiness Prioritization

StrategyDesign

riskassessment

businessimpactanalysis

Risks,

Vulnerabilities

and Threats

programassessment

Impacts

of

Outage

RTO/RPO

•Maturity Model

•Measure ROI

•Roadmap for Program

ProgramDesign

Current

Capability

Implement programvalidation

Estimated

Recovery Tim

e

ResilienceProgram

Management

Awareness, Regular Validation, Change Management, Quarterly Management Briefings

crisis team

businessresumption

disasterrecovery

highavailability

1. People2. Processes3. Plans4. Strategies5. Networks6. Platforms7. Facilities

Database andSoftware design

High Availability Servers

Storage, Data Replication

High Availabilitydesign

Source: IBM STG, IBM Global Services

Data strategy = collecting information, prioritizing, vulnerability/risk, scope

Data

Strategy

Page 30: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

30

Data Strategy: relationship to Business, IT Strategies

Business Strategy

Business

Scope

Distinct

CompetenciesBusiness

Governance

IT Strategy

Technology

Scope

System

CompetenciesIT

Governance

Organization, Infrastructure,

Process

Process

Skills Tools

IT Infrastructure

And processes

IT

Infrastructure

Processes Skills

Business Strategies

IT Strategy

Data Strategy

Enterprise IT Architecture

IT Infrastructure

People

Process

Structure

Data

Technology

Data Strategy

Data Strategy Defined

Page 31: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

31

The role of the basic “Data Strategy” for HA / BC purposes

• Define major data types “good enough”– i.e. by major application, by business line….– An ongoing journey

• For each data type:– Usage– Performance and measurement– Security– Availability– Criticality– Organizational role– Who manages– What standards for this data

• What type storage deployed on• What database • What virtualization

• Be pragmatic– Create a basic, “good enough” data strategy for HA/BC purposes

• Acquire tools that help you know your data

Data Strategy Defined

Business Strategies

IT Strategy

Data Strategy

Enterprise IT Architecture

IT Infrastructure

People

Process

Structure

Data

Technology

Data Strategy

You have toknow your data

And have abasic strategy

for it

Page 32: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

32

Here’s the major difference for 2012: There are two major types of workloads:

Traditional IT Internet Scale Workloads

HA, Business Continuity, Disaster Recovery Characteristics

HA/DR/BC can be done “Agnostic / after the fact” using replication

HA/DR/BC must be “designed into software stack from the beginning”

Data Strategy Use traditional tools/concepts to understand / know dataStorage/server virtualization and pooling

Proven Open Source toolset to implement failure tolerance and redundancy in the application stack

Automation End to end automation of server / storage virtualization

End to end automation of the application software stack providing failure tolerance

Commonality Apply master vision and lessons learned from internet scale data centers

Apply master vision and lessons learned from internet scale data centers

Page 33: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

33

Site Load Balancer

Web Server Clusters

Application / DBServer Clusters

Server Clusters Disk

Production Site

Choices for high availability and replication architectures

Local backup

Applicationor database Replication

ServerReplication

StorageReplic.

Geographic Load Balancer

Geographic Load Balancer Site

Load Balancer

PIT Image, Tape B/U

Web Server Clusters

Application / DBServer Clusters

Server Clusters

Other Site(s)

Workloadbalancer

Page 34: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

34

Comparing IT BC architectural methods

• Application / database / file system replication / workload balancer– Typically requires the least bandwidth– May be required if the scale of storage is very large (i.e. internet scale)– Span of consistency is that application, database or file system only– Well understood by database, application, file system administrators– Can be more complex implementation, must implement for each application

File system,

DB, Applic.

Aware

File system,

DB, Applic.

Agnostic

• Replication – Server (traditional IT) – Well understood by operating systems administrators– Storage and application independent, uses server cycles– Span of recovery limited to that server platform

• Replication – Storage (traditional IT)– Can provide common recovery across multiple application stacks and multiple

server platforms– Usually requires more bandwidth– Requires storage replication skill set

Site Load Balancer

Web Server Clusters

Application / DB Server Clusters

Server Clusters Storage

Production Site

LocalBackup

Application / DB Replication

ServerReplication

StorReplic.

Geographic Load Balancer

Geographic Load Balancer Site

Load Balancer Replication,

PiT Image, Tape

Web Server Clusters

Application / DB Server Clusters

Server Clusters

Multiple Site(s)

WorkloadBalancer

Page 35: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

35

Principles for Internet Scale Workloads

Page 36: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

36

Internet Scale Workload Characteristics - 1

• Embarrassingly parallel Internet workload– Immense data sets, but relatively independent records being processed

• Example: billions of web pages, billions of log / cookie / click entries– Web requests from different users essentially independent of each over

• Creating natural units of data partitioning and concurrency• Lends itself well to cluster-level scheduling / load-balancing

– Independence = peak server performance not important– What’s important is aggregate throughput of 100,000s of servers

i.e. Very low inter-process

communication

• Workload Churn– Well-defined, stable high level API’s (i.e. simple URLs)– Software release cycles on the order of every couple of weeks

• Means Google’s entire core of search services rewritten in 2 years– Great for rapid innovation

• Expect significant software re-writes to fix problems ongoing basis– New products hyper-frequently emerge

• Often with workload-altering characteristics, example = YouTube

Page 37: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

37

Internet Scale Workload Characteristics - 2• Platform Homogeneity

– Single company owns, has technical capability, runs entire platform end-to-end including an ecosystem

– Most Web applications more homogeneous than traditional IT– With immense number of independent worldwide users

1% - 2% of all Internet requests

fail*

Users can’t tell difference between Internet down and

your system down

Hence 99% good enough

*The Data Center as a Computer: Introduction to Warehouse Scale Computing, p.81 Barroso, Holzlehttp://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006

• Fault-free operation via application middleware– Some type of failure every few hours, including software bugs– All hidden from users by fault-tolerant middleware– Means hardware, software doesn’t have to be perfect

• Immense scale: – Workload can’t be held within 1 server, or within max size tightly-clustered

memory-shared SMP– Requires clusters of 1000s, 10000s of servers with corresponding PBs storage,

network, power, cooling, software– Scale of compute power also makes possible apps such as Google Maps, Google

Translate, Amazon Web Services EC2, Facebook, etc.

Page 38: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

38

IT architecture at internet scale

• Internet scale architectures fundamental assumptions:– Distributed aggregation of data

– High Availability, failure tolerance functionality is in software on the server

– Time to Market is everything• Breakage = “OK” if I can insulate that from user

– Affordability is everything– Use open source software where-ever possible

– Expect that something somewhere in infrastructure will always be broken

– Infrastructure is designed top-to-bottom to address this

• All other criteria are driven off of these

Criteria:

Cost

Extreme:

- Scale- Parallelism- Performance- Real time-Time to Market

Page 39: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

39

SERVER HARDWARE

RHEL 2.6.X PAE

RACK

INTERIOR NETWORK IPv6

GFS / GFS II

BigTable MapreduceBigTable

Chubby Lock

GOOGLE APP ENGINE

Python, Java, C++, Sawzall, other

DC

GOOGLE APPSSEARCH

INDEXCRAWLGMAIL...

Architecture

Python. Java. C++

Exterior Network

GWQ

1. Google File System Architecture – GFS II

2. Google Database - Bigtable

3. Google Computation - MapReduce

4. Google Scheduling - GWQ

For Internet Scale workloads, Open Source based internet-scale software stack

Example shown is the 2003-2008 Google version:

The OS or HW doesn’t do any of the redundancy

Reliability, redundancy all in the “application stack”

Page 40: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

40

Internet-scale IT infrastructure

Inp

ut

from

th

e I

nte

rnet

You

r cu

sto

mers

HA/DR/BC

For

InternetScale

Workloads

Each red block is an inexpensive server = plenty of power for its

portion of workflow

Page 41: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

41

Warehouse Scale Computer programmer productivity framework example

• Hadoop– Overall name of software stack

• HDFS– Hadoop Distributed File System

• MapReduce– Software compute framework

• Map = queries • Reduce=aggregates answers

• Hive– Hadoop-based data warehouse

• Pig– Hadoop-based language

• Hbase– Non-relationship database fast

lookups

• Flume– Populate Hadoop with data

• Oozie– Workflow processing system

• Whirr– Libraries to spin up Hadoop on

Amazon EC2, Rackspace, etc.• Avro

– Data serialization• Mahout

– Data mining• Sqoop

– Connectivity to non-Hadoop data stores

• BigTop– Packaging / interop of all

Hadoop components

http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond

Page 42: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

42

Summary - two major types of approaches, depending on workload type:

Traditional IT Internet Scale Workloads

HA, Business Continuity, Disaster Recovery Characteristics

HA/DR/BC can be done “Agnostic / after the fact” using replication

HA/DR/BC must be “designed into software stack from the beginning”

Data Strategy Use traditional tools/conceptsw to understand / know dataStorage/server virtualization and pooling

Proven Open Source toolset to implement failure tolerance and redundancy in the application stack

Automation End to end automation of server / storage virtualization

End to end automation of the application software stack providing failure tolerance

Commonality Apply master vision and lessons learned from internet scale data centers

Apply master vision and lessons learned from internet scale data centers

Page 43: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

43

Principles for Architecting IT HA / DR / Business Continuity

Page 44: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

44

Key strategy: segment data into logical storage pools by appropriate Data Protection characteristics (animated chart)

• Continuous Availability (CA) – E2E automation enhances RDR– RTO = near continuous, RPO = small as possible (Tier 7)– Priority = uptime, with high value justification

Lower cost

• Rapid Data Recovery (RDR) – enhance backup/restore– For data that requires it– RTO = minutes, to (approx. range): 2 to 6 hours– BC Tiers 6, 4– Balanced priorities = Uptime and cost/value

• Backup/Restore (B/R) – assure efficient foundation – Standardize base backup/restore foundation – Provide universal 24 hour - 12 hour (approx) recovery capability– Address requirements for archival, compliance, green energy– Priority = cost

Mission Critical

Know and categorize your data -

Provides foundation for affordable data protection

Know and categorize your data -

Provides foundation for affordable data protection

Enabled by

virtualization

Page 45: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

45

Virtualization is fundamental to addressing today’s IT diversity

Virtualization

Page 46: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

46

Virtualized IT infrastructure Business Processes

Virtualized systems become the resource pools that enable the recoverability

Consolidated virtualized systems become the Recoverable Units for IT Business Continuity

Virtualization

Page 47: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

47

Recovery Time Objective

15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days

Co

st

/ Va

lue

BC Tier 4 – Add Point in Time replication to Backup/Restore

BC Tier 3 – VTL, Data De-Dup, Remote vault

BC Tier 2 – Tape libraries + Automation

BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery

BC Tier 6 – Add real-time continuous data replication, server or storage

BC Tier 1 – Restore from Tape

High Availability, Business Continuity Step by Step virtualization journey Balancing recovery time objective with cost / value

BC Tier 5 – Add Application/database integration to Backup/Restore

Recovery from a disk image Recovery from tape copy

Foundation

Storage pools

Page 48: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

48

Storage PoolsApply appropriate server,

storage technology

Real Time replication(storage or server or

software)

Real Time replication(storage or server or

software)

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

- Foundation backup/restore- Physical or electronic transport

- Foundation backup/restore- Physical or electronic transport

PetaByteUnstructured

PetaByteUnstructured

PetabyteUnstructured

PetabyteUnstructured

Petabyte unstructured, due to usage and large scale, typically uses

application level intelligent redundancyfailure toleration design

Petabyte unstructured, due to usage and large scale, typically uses

application level intelligent redundancyfailure toleration design

Real-time replication

Point in time

Removable media

File, application, or disk-to-disk

periodic replication

Add automated failover to replicated storage

Page 49: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

49

Recovery Time Objective

Co

st

Methodology: Traditional IT HA / BC / DR in stages, from bottom up

SAN SAN

Add: Point-in-time Copy, disk to disk, Tiered Storage (Tier 4)Foundation: electronic vaulting, automation, tape lib (Tier 3)

Foundation: standardized, automated tape backup (Tier 2, 1)

Disk VTL/De-DupDisk VTL/De-Dup VTL/De-Dup

•IBM FlashCopy, SnapShot•IBM XIV, SVC, DS, SONAS•IBM Tivoli Storage Productivity Center 5.1

•IBM ProtecTier•IBM Virtual Tape Library•IBM Tivoli Storage Manager Backup/restore

•VTL, de-dup, remote replication at tape level

Page 50: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

50

Recovery Time Objective

Co

st

SAN SAN

Add: Point-in-time Copy, disk to disk for backup/restore (Tier 4)Foundation: electronic vaulting, automation, tape lib (Tier 3)

Foundation: standardized, automated tape backup (Tier 2, 1)

Disk VTL/De-DupDisk VTL/De-Dup VTL/De-Dup

Applicationintegration

Applicationintegration

Automate applications, database for replication and automation (Tier 5)Consolidate and implement real time data availability (Tier 6)

Datareplication

Data replication

End to end automated site failover servers, storage, applications (Tier 7)

Dynamic

End to endAutomatedFailover:Server

StorageApplications

Methodology: traditional IT HA / BC / DR in stages, from bottom up

If storage: •Metro Mirror, Global Mirror, Hitachi UR•XIV, SVC, DS, other storage•TPC 5.1

•VMWare•PowerHA on p

•Tivoli FlashCopy Manager

•Server virtualization

Page 51: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

51

Technology Deployments in Cloud

EnterpriseData Center

Private Cloud

1EnterpriseEnterprise

Data Center

Co-lo operated

Managed Private Cloud

Co-lo owned and operated Co-lo owned

and operated

Hosted Private Cloud

2 3

• Consumption models including client-owned and provider-owned assets

• Delivery options including client premise & hosted

• Strategic Outsourcing clients with standardized services

Operated or

Co-located

Enterprise AEnterprise

BEnterprise C

Shared Cloud Services

4

• Standardized, multi-tenant service

• Pay-per-usage model with provider-owned assets

Pay-per-Usage

User A

User B

User C

User D

User E

Public Cloud Services

5

• Supporting compute-centric workloads

• Finer granularity in multi-tenancy model

• Provider-owned assets

Compute Cloud Persistent StoragePrivate Cloud

• Client-managed cloud

• Internal or partner implementation services

Page 52: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

52

Cloud as remote site deployment options

Real Time replication(storage or server or

software)

Real Time replication(storage or server or

software)

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

- Point in Time Copies- Physical or electronic transport

- Point in Time Copies- Physical or electronic transport

PetaByteUnstructured

PetaByteUnstructured

PetabyteUnstructured

PetabyteUnstructured

Petabyte level storage typicallyuses intelligent file or application replication

due to large scale, usage patterns

Petabyte level storage typicallyuses intelligent file or application replication

due to large scale, usage patterns

ProductionRecovery

inCloud

Page 53: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

53

VirtualizedStorageData strategy

remote cloud

Real Time replication(storage or server or

software)

Real Time replication(storage or server or

software)

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

- Point in Time Copies- Physical or electronic transport

- Point in Time Copies- Physical or electronic transport

PetaByteUnstructured

PetaByteUnstructured

PetabyteUnstructured

PetabyteUnstructured

Petabyte level storage typicallyuses intelligent file or application replication

due to large scale, usage patterns

Petabyte level storage typicallyuses intelligent file or application replication

due to large scale, usage patterns

Real-time replication

Point in time

Removable media

Disk-to-disk replication

Automated failover

Page 54: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

54

Local Cloud deployment from data standpoint

PetaByteUnstructured

PetaByteUnstructured

Page 55: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

55

Cloud providerresponsibilityfor HAand BC

Real Time replication(storage or server or

software)

Real Time replication(storage or server or

software)

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

Periodic PiT replication:-File System

- Point in Time Disk- VTL to VTL with Dedup

- Point in Time Copies- Physical or electronic transport

- Point in Time Copies- Physical or electronic transport

PetaByteUnstructured

PetaByteUnstructured

PetabyteUnstructured

PetabyteUnstructured

Petabyte level storage typicallyuses intelligent file or application replication

due to large scale, usage patterns

Petabyte level storage typicallyuses intelligent file or application replication

due to large scale, usage patterns

YourProduction

In Cloud

Recovery By

CloudProvider

Page 56: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

56

Recovery Time Objective

15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days

Co

st

/ Va

lue

BC Tier 4 – Add Point in Time replication to Backup/Restore

BC Tier 3 – VTL, Data De-Dup, Remote vault

BC Tier 2 – Tape libraries + Automation

BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery

BC Tier 6 – Add real-time continuous data replication, server or storage

BC Tier 1 – Restore from Tape

Today’s world: High Availability, Business Continuity is a Step by Step data strategy / workload journey

Balancing recovery time objective with cost / value

BC Tier 5 – Add Application/database integration to Backup/Restore

Recovery from a disk image Recovery from tape copy

Workload Types

Data Strategy

Clouddeploymentif needed

Page 57: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

57

Recovery Time Objective

15 Min. 1-4 Hr.. 4 -8 Hr.. 8-12 Hr.. 12-16 Hr.. 24 Hr.. Days

Co

st

/ Va

lue

BC Tier 4 – Add Point in Time replication to Backup/Restore

BC Tier 3 – VTL, Data De-Dup, Remote vault

BC Tier 2 – Tape libraries + Automation

BC Tier 7 – Add Server or Storage replication with end-to-end automated server recovery

BC Tier 6 – Add real-time continuous data replication, server or storage

BC Tier 1 – Restore from Tape

Recovery from a disk image Recovery from tape copy

Step by Step Virtualization, High Availability, Business Continuity data strategy

Balancing recovery time objective with cost / value

BC Tier 5 – Add Application/database integration to Backup/Restore

Continuous AvailabilityContinuous Availability

Rapid Data RecoveryRapid Data Recovery

Backup/RestoreBackup/Restore

Workload typesData Strategy

Clouddeploymentif needed

Page 58: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

59

Summary • Understand today’s best practices

– for IT High Availability and IT Business Continuity

• What has changed? What is the same?– Principles for requirements = no change

• Data Strategy– Deployment for true internet scale wkloads:

• Application level redundancy

• Strategies for:– Requirements, design, implementation– In-house vs. out-sourcing

• Step by step approach– Automation, virtualization essential– Segment workloads traditional vs. petabyte scale– Exploiting Cloud

59

DataStrategy

Workloadtypes

Clouddeployment

options

Page 59: 2012_Architects_Guide_Designing_Integrated_Multi-Product_HA_DR_BC_Solutions_v2

60