Stratus Fault-Tolerant Cloud Infrastructure Software for NFV using OpenStack

ACHIEVING AVAILABILITY AND RESILIENCY IN OPENSTACK FOR NFV

Stratus Webinar

May 26, 2015

Ali Kafel | Senior Director, Business Development | [email protected] Twitter: @akafel

Steve Hauser | CTO | [email protected]

mailto:[email protected]


NFV Overview

Defining Availability, Reliability and Resiliency

Achieving Resiliency in Applications vs Infrastructure

Software Defined Availability (SDA)

• Seamless service continuity, with no required code changes.

• Selectable levels of availability, for different control and forwarding applications

• Increasing traditional 45% utilization towards 80% to 90% utilization

Agenda

3Stratus Technologies

Stratus Technologies35 Years of Mission Critical Computing Leadership

VOS & Continuum

Intel PlatformsftServer

Hardware Fault Tolerance

Software Fault Tolerance

Software Defined Availability

2015Stratus Cloud Technologies

Proprietary Platforms

everRun Enterprise12,000+ Installed

1980 - Present

2008 - Present

Present

Page 3

RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

CPE

L2/L3

Switch

Firewall

Load

Balancer

NAT

SBC

Vendor

A

Vendor

B

Vendor

C

Monolithic Vertical Integration

Vendor A

Vendor B

Vendor C

Vendor D

Vendor E

Vendor A

Vendor F

Vendor B

Vendor A

Vendor D

RAN

Firewall

EPC

PCEF

Diameter Core

MME

OCS/OFCS

HSS

PCRF

IMS

Delamination

Virtualization

Orchestration

Linux

EP

C

Linux

PC

RF

Linux

Fire

wa

ll

Linux

IMS

…

Decoupling with NFV

Commodity Hyper Scale COTSComputing

Network Functions VirtualizationWhat Exactly Is It?

Page 4

Network Functions Virtualization

Virtualization


RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

Vendor A

Vendor B

Vendor C


Vendor A

Vendor B

Vendor C

Vendor D

Vendor E

Vendor A

Vendor F

Vendor B

Vendor A

Vendor D

RAN

Backhaul

EPC

PCEF

Diameter Core

MME

OCS/OFCS

HSS

PCRF

IMS

Delamination

Liquid Pool of DynamicallyAllocated Resources

Automation

Orchestration

Linux

EP

C

Linux

PC

RF

Linux

Fire

wa

ll

Linux

IMS

…

Decoupling with NFV

Page 5



RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

RAN

Backhaul

GPRS/1X

MSC

HLR

SMSC

Vendor A

Vendor B

Vendor C


Vendor A

Vendor B

Vendor C

Vendor D

Vendor E

Vendor A

Vendor F

Vendor B

Vendor A

Vendor D

RAN

Backhaul

EPC

PCEF

Diameter Core

MME

OCS/OFCS

HSS

PCRF

IMS

Delamination

Liquid Pool of DynamicallyAllocated Resources

Automation

Orchestration

Decoupling with NFV

Virtualization

Linux

EP

C

Linux

PC

RF

Linux

Fire

wa

ll

Linux

IMS

…

Page 6


Orchestration

Decoupling with NFV

Virtualization

Linux

EP

C

Linux

PC

RF

Linux

Fire

wa

ll

Linux

IMS

…

Page 7

Virtualization

Orchestration

Linux

EP

C

LinuxP

CR

FLinux

Fire

wa

ll

Linux

IMS

…

Decoupling with NFV


Virtualization

Orchestration

Linux

EP

C

Linux

PC

RF

Linux

Fire

wa

ll

Linux

IMS

…

Decoupling with NFV

Page 8

CommodityHigh VolumeNetworking

Virtualization

Linux

EPC

LinuxPCRF

Linux

HSS

Linux

IMS

…

L3 Routing

L2 SwitchingOptical

Transport

Control

Control

Control

Linux

Optica

l Tra

nsp

ort

Contr

ol Pla

ne

Linux

L3 R

outing

Contr

ol Pla

ne

Linux

Bill

ing

Linux

Cust

om

er

Care

Linux

NO

C

Linux

L2 S

witch

ing

Contr

ol Pla

ne

VirtualizedOSS/BSS

Commodity Hyper Scale COTS Computing

CommodityHigh Volume

Storage

VirtualizedSDN

SDNSeparates

ControlFrom

Forwarding

Orchestration

Decoupling with NFV

Network Functions VirtualizationWith Software Defined Networks

Page 9

Agenda

NFV Overview





• Selectable levels of availability, for different control and forwarding

applications


Stratus Technologies Page 2


Availability

• % of time an equipment is in an operable state ie. access

information or resources

• Availability = Uptime / Total time

Reliability

• How long a system performs its intended function.

• MTBF = total time in service / number of failures

Resiliency

• The ability to recover quickly from failures, to return to

its original form, state, etc. (just before the failure)



Therefore, a Highly Available (HA) system may not be

Highly Reliable (HRel) or Highly Resilient (HRes)

A Fault Tolerant (FT) system is Highly Available, Highly

Reliable and Highly Resilient (state is preserved)

Lose Transactions Lose Reputation Lose RevenueLose Customers

Fault Tolerant Systems Never Stop

Stateful Fault Tolerance = HA + HRel + HResWhen Seconds Count… Loss of Revenue, Reputation, Safety, Life

High Availability

StatefulFault Tolerant

Page 13

May be a few seconds, minutes or hours

Failure

Original state is lost

Original state is preserved!

Agenda

NFV Overview





• Selectable levels of availability, for different control and forwarding

applications


Three ways to provide Stateful FT in VNFs

Applications / VNFs

Operating Environment

Hardware

• Transparent – no code change• Fast & Simple Deployment• No special App Software

• Very expensive• Inefficient utilization• Special Hardware• Rigid

Applications / VNFs


with Resilience Layer

Hardware

• Transparent – no code change• Fast & Simple Deployment• No special App Software – deploy any• No Special Hardware – use commodity• Multiple Levels of Resiliency Supported• Higher efficiency of resiliency – N+k

• Higher efficiency may not be possible on very large monolithic Apps

Applications / VNFs


Hardware

• App specific state can be Customized

• Every App must be modified• Longer time to deploy• Complex• Rigid

In the Hardware In the Applications In the Software Infrastructure

Costs

& R

eso

urc

es

Pros

Cons

16

But Fault Tolerance is more than just State Protection, it is about the complete Fault Management Cycle with multiple levels of resiliency

(State Protection) Detection

Localization

Isolation Recovery

Repair

(Restore Redundancy)

We call this:


and has 4 characteristics

1. Selectable Resiliency for each VNF

2. Seamless Protection for all VNFs

3. Agility with 3rd party ecosystem

4. Efficiency of Redundancy


Stratus’ Software-Defined Architecture (SDA) Solution provides a highly resilient Cloud and NFVI

1. Seamless Protection for all VNFs• Software Defined, transparent Service Continuity, performed automatically by the

infrastructure, without Application code changes

2. Selectable Resiliency for each VNF• Deploy each VNF with selectable levels of resiliency including High Availability and

stateful Fault Tolerance (state protection), with Geo-Redundancy, without application

awareness

3. Agility with 3rd party ecosystem and any VNF• Protect all VNFs in any KVM/OpenStack environment seamlessly, with No complex

code development, testing and support – for optimal partner ecosystem

4. Efficiency of Redundancy• Unlike traditional approaches for Fault Tolerance, which limit Utilization to sub-50%, get

dramatic increase in Efficiency of Redundancy, at 80% to 90%

18

1 | Selectable Resiliency for each VNF: Software Defined Availability (SDA) with selectable levels of resiliency

Deliver Availability as an

infrastructure service to virtual and

cloud ecosystems

Firewall MME IMS Web Server

Page 18

Any application with any

availability need with application

transparency

VNF-CForwarding

Element

VNF-CForwarding

Element

VNF-CForwarding

Element

VNF-CControlElement

Monolithic VNFs Componentized VNF

Stateless Fast PathForwarding Elements

StatefulControlElement

The Right Level of Resiliency for Each Component

FT protected Control Elements

With SR-IOV enabled high-performance, low latency

Forwarding Elements

Stratus Technologies

2 | Seamless Protection: When needed, application states are protected without application awareness, in the VM Operation - Statepointing

VM instances paired between host in the cloud infrastructure

State of primary captured regularly and applied to secondary standby

If fault on primary, secondary takes over from the most recent

Statepoint without data loss

Control when information (network, storage I/O) is allowed to leave

the guest

Secondary Host

SP N-1

FaultPrimary Host

Guest Run

Epoch N-1

Guest Run

Epoch NSP N-1

SP N

SP N

Guest Run

Epoch N+1

Guest Run

Epoch N+2

Guest Run

Epoch N+1

SP N+1

Third Host(created post primary failure)

19

Guest From

Image

SP N+X

SP N+1 SP N+X

Page 19

20

Act.-Stby. Statepoint Processes & Egress Network Barrier

VM n-1 VM n+1

w/ barrier n-1

QEMU Monitor

Enqueue

VM n

QEMU Monitor

QEMU Monitor

w/ barriers n & n-1

QEMU Monitor

w/ barriers n+1 & n

QEMU Monitor

QEMU Monitor

n-1 P1

n-1 P2

n-1 P3

n-1 P4

n-1 P5

GuestVM

(Active)

QEMU (Active)

QEMU (Standby)

Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed

PC

R

PC

R

GuestEgressQueue

[snapshots]

PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended

En

qu

eu

e

En

qu

eu

e

P1 P2 P3 P4 P5

Note: For simplicity, n-2 interactions are not shown.

n P1

n P2

n P3

n P4

n+1 P3

n+1 P2

n+1 P1

1

2, P

async

3, C

5, R

4


Virtualization



Storage

SDNSeparates

ControlFrom

Forwarding

Linux

EPC

Linux

PCRF

LinuxH

SS

Linux

IMS

…

Linux

Optica

l Tra

nsp

ort

Contr

ol Pla

ne

Linux

L3 R

outing

Contr

ol Pla

ne

Linux

Bill

ing

Linux

Cust

om

er

Care

Linux

NO

C

Linux

L2 S

witch

ing

Contr

ol Pla

ne

VirtualizedOSS/BSS

VirtualizedSDN

Orchestration

Decoupling with NFV

3 | Agility with 3rd party ecosystem and any VNF

NFV and SDN Allow Low Cost Commodity HWBut When Failures Happen, Service Continuity can be Affected

Does Not Provide Five 9’s (99.999%) Reliability

Page 21


Virtualization

L3 Routing

L2 SwitchingOptical

Transport



Storage

SDNSeparates

ControlFrom

Forwarding

Stratus Automated Virtualized Resilience LayerLinux

EPC

LinuxPCRF

Linux

HSS

Linux

IMS

…

Linux

Optica

l Tra

nsp

ort

Contr

ol Pla

ne

Linux

L3 R

outing

Contr

ol Pla

ne

Linux

Bill

ing

Linux

Cust

om

er

Care

Linux

NO

C

Linux

L2 S

witch

ing

Contr

ol Pla

ne

VirtualizedOSS/BSS

VirtualizedSDN

Orchestration

Decoupling with NFV

We solved it by inserting A Virtualized Cloud Resilience Layer for NFV and SDN

Page 22


Virtualization

L3 Routing

L2 SwitchingOptical

Transport



Storage

SDNSeparates

ControlFrom

Forwarding

Linux

EPC

Linux

PCRF

Linux

HSS

Linux

IMS

…

Linux

Optica

l Tra

nsp

ort

Contr

ol Pla

ne

Linux

L3 R

outing

Contr

ol Pla

ne

Linux

Bill

ing

Linux

Cust

om

er

Care

Linux

NO

C

Linux

L2 S

witch

ing

Contr

ol Pla

ne

VirtualizedOSS/BSS

VirtualizedSDN

Orchestration

Decoupling with NFV

Stratus Automated Virtualized Resilience Layer

Stratus ProvidesA Virtualized Cloud Resilience Layer for NFV and SDN

Page 23

4 | Efficiency of Redundancy: We have designed Shadow Secondary VMs in anti-affinity rules (different

hosts) to take up much less resources than their primaries, Yielding High Utilization, and Low Additional Reserve Capacity

Page 24

A

B

C

D

A1

B1

C1D1

But before we get into details, let’s look at how Traditional Fault Tolerance is Achieved by Full HW Redundancy

• Cloud Computing environments often utilize volumes of High Density Commodity Servers for Computing

Racks Of High Density Cloud Servers

Page 25

Server Workloads that need fault tolerance typically need redundancy to run another copy in LockStep

Workloads

Racks Of High Density Cloud Servers

Page 26

• Which has typically been Supported by twice the Hardware

Racks Of High Density Cloud ServersRacks of Redundant Servers

WorkloadsJust-In-Case

WorkloadCapacity

Page 27



• Which has typically been Supported by twice the Hardware

Arranged RigidlyIn Mated Pairs

Mated Pairs Mated Pairs Mated Pairs Mated Pairs

Page 28

• When a Failure happens, a Backup Takes Over until the original is replaced, thus preserving Service Continuity

• However, backup replacement can takes days and a great deal of human intervention, during which another failure would be disastrous

Page 29

But Resource Utilization is 50% at Best

“Traditional telecom networks operate great at 45% utilization, but as AT&T becomes a software company, a reasonable goal could be 80% to 90% utilization” John Donovan, Senior EVP AT&T

45%Utilization 55%Unutilized Backup CapacityProblem

Page 30

Stratus Resilient Cloud Technology ProvidesFully Stateful Fault Tolerance at up to 80% Utilization

45%Utilization 55%Unutilized Backup Capacity


Virtu

aliz

ed R

esi

lience

37.5% Savings

Problem

Solution

• Stratus Virtualized Resilience requires much less Backup Capacity for fully stateful Functional Fault Tolerance

Page 31

Stratus Resilient Cloud Technology ProvidesFully Stateful Fault Tolerance at up to 80% Utilization



Virtu

aliz

ed R

esi

lience

77.8% More Capacity

Problem

Solution

• Stratus Virtualized Resilience could alternatively provide 78% more Actively Utilized Capacity using the same resources

Page 32

Instead of the traditional 1+1 approach, the Stratus Resilient Cloud Technology uses Software Defined Availability (SDA) which Increases Utilization and The Efficiency of

Resiliency, and Decreases Cost

Page 33

It’s based on an n+k De-Clustered redundancy approach where Shadow Secondary VMs are

deployed in anti-affinity rules (different hosts) to take up much less resources than their primaries

34

Software Defined Availability Increases Utilization and The Efficiency of Resiliency, and Decreases Cost

Sim

ple

HWMonolithic

Fwd + CTRL

SW VirtualizedDe-CoupledFwd + CTRL

AGILITYMost Traditional TelcoSystems are in this Category

Eff

icie

ncy o

f R

ed

un

da

ncy

1+1

SoftwareDefined

Availability

So

ph

isti

ca

ted

35

Software Defined Availability Increases Utilization and The Efficiency of Resiliency, and Decreases Cost

So

ph

isti

ca

ted

Sim

ple

HWMonolithic

Fwd + CTRL

SW VirtualizedDe-CoupledFwd + CTRL

AGILITYMost Traditional TelcoSystems are in this Category

Eff

icie

ncy o

f R

ed

un

da

ncy

1+1

N+1

C+CF+F

FWD

CTRL

FWD

CTRL

1+.06F+k

CTRL

FWD F+k

k<<F SR-IOV

SoftwareDefined

Availability

Page 36

Asymetric StateSync™ Redundancy

Coordinated VM Interleave Improves Performance

on High Latency Links

Primary

Compute

StatePoint™

Secondary

6%-10%

StatePoint™ Sync Link

Pro

cesso

r A

ctivity

Pro

cesso

r A

ctivity

Page 37

A

B

C

D

A1

B1

C1D1

N+k De-Clustered Redundancy

Each Server Apps in VMs ABCD Are Backed Up On Separate ServersWhich could be anywhere in the Pool of Servers


Server 2 Apps ABCD Are Backed Up On Separate ServersWhich could be anywhere in the Pool of Servers

Page 38



Page 39

Server 4 App ABCD

Page 71 Page

40



N+k De-Clustered RedundancyServer 5 Apps ABCD Are Backed Up On Separate Servers

Which could be anywhere in the Pool of Servers

Page 41


All 5 Server Apps ABCD Are Backed Up On Separate ServersWhich are shown on each other in this example

Page 42

SecondaryShadow VMs Stand Up

ReserveCapacity

Stand Up can happen on other machines with

Lower Priority Pre-Emption


Primaries, Secondaries, plus Reserve Capacity Shown for Each

Page 43

A

B

C

D

A1 B1

C1

D1

Upon Node Failure, Secondaries are ActivatedWith No Loss of State


ReserveCapacity



Page 44


ReserveCapacity



Cloud ServerResource

Pool

Recycle

One of “k” Reserve Servers is Activated

While the Failed Node is Logically Removed

Page 45

73%Utilization

27%ResilencyReserve

87%Utilization

13%ResilencyReserve


ReserveCapacity



Primaries are “Live Migrated” to New Server to Balance the Load

Yielding High Utilization, and Low Additional Reserve Capacity

Page 46

Stratus Resilient Cloud TechnologyDramatically Improves The Efficiency of Redundancy

• Enables up to 37.5% Resource Savings to provide Redundancy



Virtu

aliz

ed

Resili

ence

37.5% Savings

Problem

Solution



Virtu

aliz

ed

Resili

ence

77.8% More Capacity

Problem

Solution

• Enables up to 77% More Capacity for Protected Redundant Workloads

Benefits Either

Or a combination of the two

Page 47

“Traditional telecom networks operate great at 45% utilization, but as AT&T becomes a software company, a reasonable goal could be 80% to 90% utilization.”

John Donovan, Senior Executive Vice President, AT&T

Page 48

Core OpenStack

Orchestrator(s)OSS/BSS

49

Beyond the Virtualized Resilience Layer, the Resilience Management Layer enables Automation

Linux Host OS

Virtualized Resilience Layer

Discovery And Tagging Tool

Heat Template

Service Template

Resiliency Workload Management

Authoring Tool/Service Catalog

Service Template

NFVI domain [SDN Controller]

MANO

vSwitch

Running any Guest OS

VNFCs Instantiated in NFVI

VNFC

GuestOS

VM

VNFC VNFC

VNFM VNFM

VNFC

MANO/VIM

Heat Orchestration

API

Standard Server Platform – Commodity Off-The-Shelf (COTS)

NFVI Compute Domain[Linux/KVM+QEMU, OpenStack, OVS+Availability Services]

VNFC

GuestOS

VM

VNFC

GuestOS

VM

VNFC

GuestOS

VM

VNFC

GuestOS

VM

VNFC

GuestOS

VM Resilience Management Layer

VNF Service Template

Page 12

50

Stratus Cloud Solutions – Two Technologies

Continuous Availability Including

Stateful Fault Tolerance

Based upon Linux technology

and KVM

Available on multiple

distributions

Based on Stratus everRun

technology which is field proven,

with 12,000+ license deployed

Deployment of workloads

Automation of availability events

Layers between Orchestrators and

OpenStack distributions

Availability Services Workload Services

Resilience Management


In Summary: The Stratus Cloud Solution for telcos and Communications Infrastructures offers:

1. Seamless Protection for all VNFs• Software Defined, transparent Service Continuity, performed automatically by the

infrastructure, without Application code changes

2. Selectable Resiliency for each VNF• Deploy each VNF with selectable levels of resiliency including High Availability

and stateful Fault Tolerance (state protection), with Geo-Redundancy, without

application awareness

3. Agility with 3rd party ecosystem and any VNF• Protect all VNFs in any KVM/OpenStack environment seamlessly, with No

complex code development, testing and support – for optimal partner ecosystem

4. Efficiency of Redundancy• Unlike traditional approaches for Fault Tolerance, which limit Utilization to sub-

50%, get dramatic increase in Efficiency of Redundancy, at 80% to 90%

Seeing is believing: ETSI PoC#35

Availability Management with Stateful Fault Tolerance,

Telcos include AT&T, NTT & iBasis

Contact us to:1. See this demo and learn more about seamless Software-based

Fault Tolerance in VNFs and other Cloud applications2. Get a copy of the slide or have further questions

[email protected]: @akafel