27
Embracing Failure Self-Healing, Decentralized Resource Management for Apache CloudStack John Burwell Vice President, Software Engineering [email protected] | @john_burwell

Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack

Embed Size (px)

Citation preview

Page 1: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

Embracing FailureSelf-Healing, Decentralized Resource Management for Apache CloudStack

John BurwellVice President, Software Engineering

[email protected] | @john_burwell

Page 2: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

VP of Software Engineering @ ShapeBlue

Member, Apache CloudStack PMC (June 2013)

Ran operations and designed automated provisioning for analytic/virtualization clouds

Led architectural design and server-side development of a SaaS physical security platform

About Me

Page 3: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

“ShapeBlue are expert builders of public &

private clouds. They are the leading global

Apache CloudStack integrator & consultancy”

…and we’re hiring!

About ShapeBlue

Page 4: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Bang ups and Hang Ups Can Happen to You

Derive the normative operationdesign from failure recovery

Page 5: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

What is a Resource?Control Plane

Device

Device

Device

(Desired State)

(Actual State)

Resource

(Converges Desired with Actual State)

Eventually, the desired and actual states will be consistent

Page 6: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

CloudStack partitions resources into zones,

clusters, and pods

Page 7: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Resource status information is stale or lost

Resource definitions conflict with device state

Entropy

Failure Modes

Page 8: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Page 9: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Consistency

AvailabilityPartition Tolerance

Pick 2

Page 10: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Orchestration operations are available and eventually consistent

... but device modifications must be consistent.

Page 11: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Page 12: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Orchestration TierAP

CP Automation Control Tier

Page 13: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Desired Resource StateAP

CP Actual Resource State

Page 14: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

SchedulingAP

CP State Convergence

Resource OffersResource Status

State Transitions

Hoke

Page 15: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Simple Self-contained Locality Non-persistent

Hoke Design Goals

Page 16: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Runtime Resource View

ResourceFSM

Management

ProcessDevic

e

Queue

State Transitio

n

1

1

Monitor Process

ResourceOfferResourceStatu

s

Page 17: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

An actor represents state and behavior

Communicate by message passing — each actor has a dedicated queue or mailbox

Each actor is allocated a lightweight thread — implicit lock

Actor Model

Page 18: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

All resources represented in a directed, acyclic graph

The root node of the graph is the region organized in the following manner:region -> zone -> pod -> cluster

Each resource is a child of the partition node in which owns it

Resource Graph

Page 19: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Google’s resource scheduler Transactional shared state model

enabling sophisticated, global decision making

Supports both high churn and low churn workloads

Multiple, pluggable schedulers working in parallel

Inspiration from Omega

Page 20: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Two level scheduler Resource Offers Pessimistic Locking Pluggable Geared towards high churn workloads

Inspiration from Mesos

Page 21: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Best Effort shared-state scheduler Multiple parallel schedulers

distributed by partition Combines allocators and planners Pluggable

Hybrid Scheduler

Page 22: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Partition controllers spawn system VMs for their child partitions as need to address scheduler business and reliability guarantees

Parent partition controllers monitor the health of their child partition controllers and re-spawn as necessary

Auto Scaling, Self Healing

Page 23: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Evaluate implementing the concepts in the Orleans paper to reduce the number of active actors required

Determine best approach causality tracking for state transitions (e.g. version vectors)

Create a library implementing these concepts to demonstrate viability and separate concerns and performance test

Next Steps

Page 25: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Hindman, Benjamin; Konwinski, Andy; et. al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. 2011.

Bernstien, Philip; Bykov, Sergey; et. al. Orleans: Distributed Virtual Actors for Programmability and Scalability. 2014.

References

Page 26: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Questions

Comments

Page 27: Embracing Failure:  Self-healing, Decentralized Resource Management for Apache CloudStack

@shapeblue #ccceu

Thank you