20
MIGRATORY WORKLOADS ACROSS CLOUDS WITH NOMAD Phil Watts DevOps Artificer

Migratory Workloads Across Clouds with Nomad

Embed Size (px)

Citation preview

MIGRATORY WORKLOADS ACROSS CLOUDS WITH NOMAD

Phil Watts DevOps Artificer

PROBLEM STATEMENT

“FLEXING BETWEEN THE CLOUDS”

▸ Goals of Virtualization seem universally applicable

▸ !(Vendor Lock-in)

▸ Not all workloads are valued equally

=>=>

IT Magic Anywhere

SUCCESS CRITERIA

WIN CONDITIONS

‣ Availability of compute resources are independent of the cloud provider

‣ Batch jobs can be allocated based on point in time cost metrics

‣ Work segregation based on compliance qualifications

TOOLCHAIN

MY CURRENT “FAVORITE” TOYSResources

Image Creation

Infrastructure Provisioning

Service Discovery

Scheduler

Driver

DEFINITIONS: RESOURCE CONTEXT

THE BANE OF TECHNICAL UNDERSTANDING (AKA WORDS):

▸ Region: The isolation boundary of a Nomad Cluster

▸ Datacenter: Low latency, high bandwidth, private network

▸ Resources: The available capacity provided by a node

Region Datacenter

AWS Continental AWS_Region

GCE Continental GCE_Region

Azure Location Location

Region Datacenter

AWS Global AWS_Region

GCE Global GCE_Region

Azure Global Sets of Locations

Common / Comfortable Pattern Ideal Pattern

NOMAD ARCHITECTURE - SINGLE REGION VIEW

BDFL FOR WORKLOAD DECISIONS

‣ In Nomad, Datacenter can speak to Region Aware Servers

‣ Datacenters don’t need to be the same platform

‣ Default Region is “global”

ARCHITECTURE OF SOLUTION

▸ Nomad Clients potentially provide Resources for Jobs

▸ Communication between Datacenters may need secured

▸ Nodes run a Consul Agent and Nomad Client

▸ Nomad Servers “Bin Pack” task onto nodes

THREE PICTURES OF THE SAME THINGSingle Region / Multi DataCenter

(different Clouds)

DEFINITIONS: TASK CONTEXT

WORDS: THE SEQUEL▸ Task: Desired state declaration of workload

▸ Constraints: Rules limiting where a job can run

▸ Evaluations: Queued request to compare desired and present state of work over the region

▸ Caused by a state change event

▸ Job Completion

▸ Node Addiction/Subtraction

▸ Job Scheduled

▸ Allocations: Mapping of tasks to resources within constraints

JOB TYPES: SERVICE

KEEPING THE SITE UP

▸ Long running jobs that should always be available

▸ Scheduling decisions favor QoS

▸ Example: Ensuring a front end web service is always available

JOB TYPES: BATCH

WHAT TO DO WITH ALL THIS DATA?

▸ A set of work spanning a few minutes to a few days

▸ Based on the Berkley Sparrow Two Choices model

▸ http://people.eecs.berkeley.edu/~keo/publications/sosp13-final17.pdf

▸ Probes a set of nodes which meet constraints and sends work to the "least loaded" nodes

▸ Example: Tasks to manipulate a queue of data when present

JOB TYPES: SYSTEM

KEEPING THE LIGHTS ON

▸ A unique job type used to declare jobs which should run on every node which meets the job constraints

▸ Are re-evaluated whenever a node joins the cluster

▸ Example: distributing common tasks, which can benefit from rolling updates, job updates, service discovery

NOMAD SCHEDULING INTERNALS

GETTING FROM WORK AND RESOURCES TO ACCOMPLISHMENTS

▸ Evaluations read the Job Specification and find constraints

▸ Evaluation Brokers maintain the pending queue, priority, and at least once delivery

▸ Schedulers submit an Allocation Plan, evaluated for feasibility, followed by priority

▸ Allocations set jobs against resources

LIKE TETRIS FOR WORKLOADS

▸ Tasks require resources

▸ Nodes have “dimensions” of resources

▸ Allocation fits Tasks inside Nodes

BIN PACKING

TASK GROUPS

PREVENTING TASK SEPARATION ANXIETY

▸ Task Groups allow for multiple Jobs to require they are scheduled on the same node

▸ Are created implicitly for single tasks in isolation

▸ Can be used to enforce compliance elements required to run together

▸ Example: Requiring log shipping co-processes

CONSTRAINTS

JUST BECAUSE YOU CAN, DOESN’T MEAN YOU SHOULD▸ Job Constraints limit the resources available for a particular

job group

▸ Constraints can map workloads directly to Customized Hardware such as AWS Placement Groups

CONSTRAINTS AND COMPLIANCE

SATISFYING COMPLIANCE REQUIREMENTS

▸ Constraints on datacenter can be used for Data Isolation inside National Boundaries.

▸ Healthcare workload that must stay within the EU

▸ Metadata attributes can allow for custom declarations.

▸ Eg. PCI DSS Compliance:

▸ Maintain network firewall

▸ Protect run Anti-Malware/Anti-Virus

▸ Monitor and log access

▸ Regularly test security systems and procedures.

1 job "sample_service" { 2 ... 3 meta { 4 pci_dss = true 5 } 6 group "webservice" { 7 constraint { 8 attribute = "meta.pci_dss" 9 value = true 10 } 11 } 12 }

Constraint Snippet

CONSTRAINTS: SATISFYING SPECIAL NEEDS

DIFFERENT THINGS ARE DIFFERENT

▸ Not all platforms are created equal

▸ Platform attributes for specifying Cloud Platforms

1 job "sample_service" { 2 ... 3 constraint { 4 attribute = attr.platform 5 value = aws 6 } 7 }

▸ ${attr.platform} = aws May be relevant if you needFloat (GPU) processing, which AWS offers and GCE doesn’t

RAW EXECS

CHEKHOV’S TASK DRIVER

▸ Unconstrained, Un-isolated, Disabled by Default

“IT SEEMS TO BE A DEEP INSTINCT IN HUMAN BEINGS FOR MAKING EVERYTHING COMPULSORY THAT ISN'T FORBIDDEN”

▸ Runs as the user Nomad is running as

▸ Disabled by default

client { options = { driver.raw_exec.enable = 1 } }

~Robert A. Heinlein

OPERATOR INTERACTION

RELIABLE MAGIC = OPERATIONS

1 $ nomad run jobfile.nomad -address=$nomad_server

‣ Operators schedule jobs against a server

‣ Nomad figures out how/where/when to run tasks

‣ Complex solution through iteration

Phil Watts DevOps Artificer @ REĀN Cloud

@pwattstbd github.com/marsupermammal

[email protected] www.reancloud.com

import "os"

func presentation() { os.Exit(0) }