15
Automated Failure Management PUTTING TECHNOLOGY TO WORK IN BUSINESS CONTINUITY Indra Mohan, CEO, Enigmatec Corp

Automated Failure Management - · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

Embed Size (px)

Citation preview

Page 1: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

Automated Failure Management

PUTTING TECHNOLOGY TO WORK IN BUSINESS CONTINUITY

Indra Mohan, CEO, Enigmatec Corp

Page 2: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20052

AgendaAgendaAgenda

• Company Introduction• Business Continuity Planning - The IT Challenge• Automated Failure Management - Use-Case• Our Product• Technology Trends• Summary

Page 3: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20053

Enigmatec Company IntroductionEnigmatec Company IntroductionEnigmatec Company Introduction

• Founded in 2001 to commercialise research pioneered at Edinburgh and Cambridge University

• Offices in London, New York & Palo Alto

• Policy-based Automation Product released June 2004 • We automate the Data Center “Run Book”

• Partners include:• Intel• Sun Microsystems• VMware

Page 4: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20054

Business Continuity Planning The IT ChallengeBusiness Continuity Planning Business Continuity Planning The IT ChallengeThe IT Challenge

OVERALL BUSINESS GOALS: LESS DOWNTIME

EXTERNAL DRIVERS

COMPETITIONCOMPETITION

NEW REGULATIONSNEW REGULATIONS

Number ofNumber of FailuresFailuresINTERNAL

DRIVERSRecovery TimeRecovery Time

DO MORE WITH LESSDO MORE WITH LESS

SINGLE SYSTEM VIEWSINGLE SYSTEM VIEW

Page 5: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20055

Business Continuity PlanningThe Situation TodayBusiness Continuity PlanningBusiness Continuity PlanningThe Situation TodayThe Situation Today

• Data Center Business Continuity Today• Hardware is Over-Configured• Management Tools Were Built To Monitor, Not RESPOND• Too Much Human Intervention

• As A Result, Failure Response is • Inconsistent & Error-Prone• Too Slow

• Current Solutions Are Inadequate• Scripts and Manual Procedures Do Not Scale• High-Availability Clustering Is Expensive

Page 6: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20056

Business Continuity PlanningHow Technology Will HelpBusiness Continuity PlanningBusiness Continuity PlanningHow Technology Will HelpHow Technology Will Help

RECENT EVOLUTION OF IT MANAGEMENT SOFTWARE

• PROVISIONING: CONFIGURATION OF APPLICATIONS• VIRTUALIZATION: SERVER CONSOLIDATION• POLICY-BASED AUTOMATION: RUNNING OF APPLICATIONS

2001 2002 2003 2004 2005 2006

Page 7: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20057

Automated Failure Management Use CaseMajor Investment BankAutomated Failure Management Use CaseAutomated Failure Management Use CaseMajor Investment BankMajor Investment Bank

Application Environment• Trading System • Compute Grid

The Challenge• Failure Response is Slow and Prone to Error• Hardware is over-configured and Under-utilized

TRADING SYSTEM & COMPUTE GRID

SUN HARDWARE

SOLARIS O/S

DATABASE

APP SERVER

CLIENT-BUILT COMPUTE GRID

SYSTEMS M

ON

ITOR

INTEL HARDWARE

LINUX O/S

DATABASE

APP SERVER

CLIENT-BUILT TRADING APPS

SYSTEMS M

ON

ITOR

Page 8: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20058

Automated Failure Management Use CaseData Center EnvironmentAutomated Failure Management Use CaseAutomated Failure Management Use CaseData Center EnvironmentData Center Environment

Run Books Scripts ManualAd Hoc

Alerts LogsWarnings

• The Traditional Approach • 3 Data Centers• 4 Environments (Production, BC, Test & Dev.)• Application recovery procedures are manual and

script-based

• Goals• Repeatable Failure Policies: Process Patterns• Reduce Application Recovery time from 1 hour to

15 minutes • Eliminate BC Data Center

NY - Production NJ - Test, Dev.

Delaware - BC

ServersNetwork Disk/SAN

ServersNetwork Disk/SAN

ServersNetwork Disk/SAN

Page 9: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 20059

Automated Failure Management Use CaseThe Enigmatec SolutionAutomated Failure Management Use CaseAutomated Failure Management Use CaseThe Enigmatec SolutionThe Enigmatec Solution

SLA Monitor

• Enigmatec Solution

• Enigmatec Detects failure in the Production Data Center

• Repurposes the Test & Dev. Data Center into BC

• Restarts the application stack• Manual procedures automated using

extensible policy-driven workflow

NY - Production NJ - Test, Dev.

ServersNetwork Disk/SAN

ServersNetwork Disk/SAN

Page 10: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 200510

Automated Failure Management Use Case Benefits and Next StepsAutomated Failure Management Use Case Automated Failure Management Use Case Benefits and Next StepsBenefits and Next Steps

• Benefits• Reduced Application Recovery Time

• 15 Minutes vs. 1 Hour +• Doing More With Less

• Over $1M in Hardware Savings• Over $2M per year OpEx Savings

• Next Steps• Roll Out To Additional Business Units• Automate Additional IT Procedures:

• Scale-In / Scale-Out• Validation and Testing of Production Systems

• Single Systems Management View

Page 11: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 200511

The Enigmatec Product – Policy-Based AutomationKey FeaturesThe Enigmatec Product The Enigmatec Product –– PolicyPolicy--Based AutomationBased AutomationKey FeaturesKey Features

• Extensible Policy Execution Language • Self-organized deployment of policies across the network• Automated service discovery• Changes can be deployed “on-the-fly”

• No Single Point of Failure• Distributed Agents• Full peer-to-peer architecture• No centralized server

• Monitoring and Execution of Policies to SLA’s• Simultaneously platform and application aware• Full service-oriented architecture

Page 12: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 200512

How We Do ItCore ConceptsHow We Do ItHow We Do ItCore ConceptsCore Concepts

WORKFLOWS

IT ELEMENTS

POLICIES

Elements: “What” to Automate• All Leading Platforms• Pre-Built Interfaces

Workflows: “How” to Automate• Replace Scripts and Manuals• Device Specific

Policies: “When, Where, and Why”• Easy to Design• East to Manage

+

=

Page 13: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 200513

How We Do ItA Distributed Service GridHow We Do ItHow We Do ItA Distributed Service GridA Distributed Service Grid

1. Design Policies & Set SLA/RTOs

Web FarmsClusters App Servers Blades

Racks Network

Disk/SAN

Design Repository

PHYSICAL RESOURCES

ENIGMATEC SERVICE GRID

2.Test/Verify 3.Deploy to DataCenter

4.Agents Monitor for Failure and Execute Policies

5.Agents Report SLA and RTO Behavior to Console

Page 14: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 200514

Data Center Macro TrendsWhere We Are GoingData Center Macro TrendsData Center Macro TrendsWhere We Are GoingWhere We Are Going

• Shift from Dedicated to Shared Compute• Similar to Evolution of Storage Area Networks• Dynamic Re-Allocation of Resources

• Modular Data Centers• SLA-Based Management• Utility “Dial-Tone” Computing

…and Automated Failure Management is the First Step

Page 15: Automated Failure Management -  · PDF file• We automate the Data Center “Run Book ... • Trading System ... and Automated Failure Management is the First Step

© Enigmatec Corporation 200515

Thank You!