High Availability DevOps - NERCOMP...• 5-10 minutes to deploy a Linux template after adding it to authentication domain and defining some metadata. • Build and deploy a Docker

High Availability

DevOps

HA Features for

Docker Swarm and

GitLab

High-Availability DevOps

Deploying and managing a DevOps environment requires

attention to the elimination of single points of failure.

Using open source High-Availability and Desired State

Configuration tools, we address the availability and

maintainability of our overall DevOps environment and the

resources and services that it requires.

Topics to be Covered

• DevOps single points of failure

• Tools and methods to ameliorate risk

• Infrastructure as Code

• SRE Error Budgeting

Environment Overview

• Test and Prod Swarms, each 5 nodes– Docker CE

– Ubuntu 18.04

• GitLab CE for CI/CD and Docker Registry

• Apache 2 Load Balancers for Apps

• SaltStack codebase defines infrastructure

Failure Modes

Example DevOps Infrastructure

Orchestration

4

3

5

2

1

Codebase, Integration &

Deployment

Infrastructure &

Services

4

3

5

2

1

TEST

PROD

GitLabDatabases

Services

Applications

Infrastructure as Code

SaltStack

Application Dependencies

network LB

RISK: 1 load

balancer for

ingress

4

3

5

2

1

RISK: node

availability,

ingress

availability

LDAP

SMTP

SQL

Infrastructure

Services

Deployment Dependencies

Repository CI/CD

Script

RISK:

single VM

REMEDIATION:

HA Deployment

or Cloud

RISK:

TEST==PROD?

REMEDIATION:

Same CI Code

with Interpolation

go

Deploy

Container

(Runner)

RISK:

Runner Available

REMEDIATION:

Pacemaker

Bundle

audit+

health+

monitor

Validation

RISK:

Did it deploy and

stay deployed?

REMEDIATION:

Auditing,

Healthcheck,

Monitoring

Swarm Topology

node3

node5

node2

node1

Manager

Leader

Runner

Ingress

node4

Swarm Topology Failure Response

● Partition might lead to a leader

election

● Mesh network means any

node can have an ingress to a

stack’s service.

● Swarm will try to maintain

replica requirement.

node3

node5

node2

node1

node4

(our) Swarm Integration 1

● In order to run docker stack deploy a GitLab runner (a

container) must be on a manager node — we’re making

all peer nodes managers and using Pacemaker bundle

to ensure container start.

● Having a DNS VIP ingress requires network and Docker

reconfiguration and restart (we have a script in salt-call

and call that from a Pacemaker alert monitor.)

(our) Swarm Integration 2

● Although Docker Swarm is supposed to ensure that the

requested number of replicas are started, in practice,

there is occasionally a deficit, especially after an event.

● After an cluster event, another salt-call script is run that

simply updates any service not running enough

replicas.

● Automated deployment and service updates requires

valid registry authorization (We use CI_TOKEN in

deployment with a credential helper.)

(our) Load Balancer

• Apache2 with mod_proxy

• Location directive to map URI to a service

• One load balancer: unscheduled

maintenance impossible

• One proxy entry: single point of ingress

Application Environment

● Applications behind LB could be in

container environments, on VM or in

cloud.

● Container environment is Docker Swarm

● Services generally provisioned by VMs

Load Balancer Topologies

LB

1

2

4

3

5

<Location /app1>

RedirectMatch "(.*)/app1$" \

"https://appsdemo.holycross.edu/apps1/$1"

require all granted

ProxyPass https://swarmdemo1.holycross.edu:6549 retry=5 \

acquire=3000 timeout=600 Keepalive=On

...

ProxyPass https://swarmdemo5.holycross.edu:6549 retry=5 \


ProxyPassReverse https://swarmdemo1.holycross.edu:6549

...

ProxyPassReverse https://swarmdemo5.holycross.edu:6549

SetEnv proxy-sendchunked 1

</Location>

Pacemaker

Load Balancer Clustered Ingress

LB

1

2

4

3

5

<Location /app1>



require all granted

ProxyPass https://swarmdemo.holycross.edu:6549 retry=5 \


ProxyPassReverse https://swarmdemo.holycross.edu:6549


</Location>

Pacemaker Pacemaker

Clustered Load Balancer

LB1

2

4

3

5

<Location /app1>



require all granted

ProxyPass https://swarmdemo.holycross.edu:6549 retry=5 \


ProxyPassReverse https://swarmdemo.holycross.edu:6549


</Location>

LB

Pacemaker Pacemaker

Dual Ingress

LB1

2

4

B

3

5

A

<Location /app1>



require all granted

ProxyPass https://swarmdemoA.holycross.edu:6549 retry=5 \


ProxyPass https://swarmdemoB.holycross.edu:6549 retry=5 \


ProxyPassReverse https://swarmdemoA.holycross.edu:6549

ProxyPassReverse https://swarmdemoB.holycross.edu:6549


</Location>

LB

https://swarmdemoa.holycross.edu:6549



Reducing Risk1 Ingress

1 Balancer

HA Ingress

1 Balancer

HA Ingress

HA Balancer

HA Ingress (2)

Single Point

Failure?ES

YES YES NO NO

Transition

Ingress (s)

Intervention 13 sec. 13 sec. < 13 sec.

Transition

Balancer (s)

Intervention Intervention 1 sec. < 1 sec.

HA Load Balancer

● Configure 2 (or more) Apache servers with

proxy configuration in a Pacemaker

configuration with a VIP.

● If a load balancer crashes or needs

maintenance, Pacemaker can move the load

balancer service to an alternate node, manually

or automatically.

DevOps Storage Models

• Storage reliability and manageability is

already fairly high because of clustering

and LVM.

• Many storage requirements can be

managed using databases, repositories,

or tagged storage.

Storage Failure Modes

• One way to manage larger storage usage

by a service is to map it to a Docker

volume through a share/mount.

• This presents an availability issue for the

sharing node, either for node failure or a

maintenance window.

Tools & Methods

Tools & Methods Overview

• HA Cluster– Pacemaker/Corosync

• Desired State Configuration– SaltStack

• Highly Available Storage– DRBD, S2D

High-Availability Clustering

• IPaddr2 resource virtual IP resource will be

auto-managed by the cluster.

• alerts event handlers run on nodes before or

after a cluster event, used to update

configuration.

• Docker bundle ensures that GitLab runner

containers are on each node.

Desired State Configuration

• Configuration for Docker, the cluster, alerts

and the ingress VIP stored in a YAML pillar

database.

• (push) salt state.apply to build Docker

nodes, configure alerts, VIP, etc.

• (pull) salt-call state.apply to update running

configuration of a node’s daemon.json.

Redundant Swarm Ingress

pillar YAML configuration for Virtual IP:

swarmtest_vip_cib:

resource:

swarmtest_vip:

resource_type: "ocf:heartbeat:IPaddr2"

resource_options:

- 'ip=192.168.1.120'

- 'cidr_netmask=32'

- 'iflabel=IP_VIRTUAL'

Docker Node Self-Configuration

● At initial node build, or on an event, SaltStack reads the

configuration in serialized (JSON) form from a Salt

‘pillar’ data set.

● The Salt ‘pillar’ is also dynamically configured with

current network configuration, independently of the

logical configuration of the Swarm.

● Changes to the /etc/docker/daemon.json file will trigger

a restart of Docker (i.e., with updated network

addresses.)

Daemon_JSON Salt pillar fragmentDaemon_JSON:

{{grains.get('docker-swarm-name','')}}:

hosts:

- "fd://"

{% for interface,addresses in grains.get('ip4_interfaces',{}).items() %}

{% if interface is not match('docker*') %}

{% for ip in addresses %}

- "tcp://{{ip}}:2376"

{# addresses #}

{% endfor %}

{% endif %}

{% endfor %}

storage-driver: overlay2

Docker.daemon state fragment 1

Daemon_Running:

service.running:

- name: docker

- enable: True

- restart: True

- watch:

- file: /etc/docker/daemon.json

Docker.daemon state fragment 2

Daemon_JSON_{{pillar_name}}:

file.serialize:

- name: /etc/docker/daemon.json

- dataset_pillar: "{{pillar_path}}"

- formatter: json

- merge_if_exists: True

- show_changes: True

- user: root

- group: root

- mode: 644

Redundant Services

● load balancers

(Apache, NGINX)

● smtp gateway

(Postfix, sendmail)

Redundant Filesystems and Shares

● Vendor solutions

○ EMC Isilon (CIFS+NFS)

○ Netapp (CIFS+NFS)

○ Pure Storage (CIFS+NFS)

● Microsoft

○ Azure Stack HCI (CIFS/ReFS)

● Open source

○ DRBD (NFS)

○ CEPH (CIFS, NFS, S3)

Docker Volumes and HA

● Most of our containerized applications either

use a database directory, or manage data on a

docker volume through a repository.

● We have a few static websites where we need

HA disks which we map onto docker volumes

via network filesystems.

Mapping NFS to Docker

● Allow docker swarm to manage the NFS

or CIFS mount in the compose file.

● HA disk server keeps mount available

during unexpected or scheduled

downtime.

Compose NFS Mount Definition

volumes:

- type: volume

source: web-cgibinintranet

Target:

$HC_WEB_CGIBININTRANET_MOUNTPOINT

volume:

nocopy: true

Compose NFS Volume Definition

volumes:

web-cgibinintranet:

driver_opts:

type: "nfs"

o:

"nfsvers=4,addr=sanfs1.holycross.edu,ro"

device: ":/sanfs/cgibinintranet"

Compose CIFS Mount Definition

volumes:

- type: volume

source: web-cifs

target: $HC_ALT_LEGACY_MOUNTPOINT

volume:

nocopy: true

Compose CIFS Volume Definition

volumes:

web-cifs:

driver_opts:

type: "cifs"

o:

"username=${USER},password=${PASS},domain=${DOM

AIN},iocharset=utf8,uid=${UID},gid=${GID}"

device: "//${SMB_SERVER}/web/legacy"

Infrastructure as Code

Motivations and Benefits

• Apply DevOps to system administration

– Repository, pipelines, issues, documentation

• Push configuration to build standardized templates,

validate, deploy and audit.

• Pull configuration for events, triggers, self-configuration.

• Extend and reuse code across platforms.

Build and Deployment

• We apply about 333 formulas on a typical Linux deploy

to ensure desired configuration.

• 5-10 minutes to deploy a Linux template after adding it

to authentication domain and defining some metadata.

• Build and deploy a Docker node in about 20-30 minutes

using a base template deploy.

• We also build complex clusters and application services

on top of nodes built this way.

Validation and Audit

• We validate and apply over 200 CIS rules on a base

Linux deployment, and additional CIS rules for Docker,

MySQL, Postgres, Apache, as well as internnally

developed best practices.

• Standardized tags on best practice rules can be parsed

into JSON for parsing into compliance reports and

documentation.

Self-Configuration

• Interactively fix a configuration knowing the change is

already documented in code.

• An event or trigger can reconfigure the running system

according to current state rather than the state at build

time.

SRE Error Budgeting

Infeasible 100%

• As much as we’d like to have 100% uptime, we cannot

possibly guarantee that, and all of our infrastructure

needs occasional maintenance.

• We perform scheduled maintenance, but it is difficult to

schedule, and disruptive. DevOps, clustering and

virtualization generally has increased our ability to

safely perform unscheduled maintenance.

Current Monitoring

• Our current monitoring is cloud based, and simply

measures service availability.

• We need richer indicators with measurable objectives,

that lead to defined responses.

Service Level Indicators: SLI

• SLI - service level indicator.

– A good SLI should be at least a scalar, e.g., instead

of measuring ‘uptime’, we could measure ‘errors per

interval.’

– Try to standardize common SLIs for reuse.

– Naturally, some SLIs will be specific to the service.

Service Level Objectives: SLO

• Set internal objectives which will be used to manage

change.

• Typically once you set the SLO, say, “99.5% of

transactions will have an average latency of less than

500ms”, you define your error budget as 1 - n. So in

that case, if your latency average climbs above 500,

you exceeded your error budget. When we exceed our

error budget, we change our focus from new features to

stability.

Service Level Agreement: SLA

• The SLA will be the agreement you have with the

customer, and it will generally be a looser objective than

the SLO.

• As in the SLO, the SLA will need to have consequences

for exceeding the error budget, in the case of an internal

customer, perhaps a review.

Tool Versions

• ClusterLabs pacemaker 1.1.18

• RedHat corosync 2.4.3

• Docker CE 19.03.4

• Ubuntu Server 18.04

• Windows Server 2016 Datacenter 1607

• Virtual Box (for demos) 6.0.14

Some Unreviewed Tools

• Load balancing with IPVS/VRRP– Keepalived

– NGINX

– Traefik

• Storage Alternatives– S2D Azure Stack HCI

– CEPH

Documents

High Availability DevOps - NERCOMP...• 5-10 minutes to deploy a Linux template after adding it to authentication domain and defining some metadata. • Build and deploy a Docker