Distributed Health Checking for Compute Node High Availability€¦ · Service Discovery Tool...

Distributed Health Checking for Compute Node High Availability

ZhengSheng Zhou@AWcloud Alex Xu@Intel, JunWei Liu@ChinaMobile

AgendaCompute Node High Availability

Distributed Health Checking

Nova Considerations

Ceilometer Considerations

Compute Node High Availability

Compute Nodes• Some of our enterprise customers run pet workloads• More than Hundreds of nodes per region• Need scalable HA Solution

Compute Node Failure• Network Cable Broken / Loose• Host / Switch Power Failure

• Can not Connect to Host IPMI• Kernel Stall / Panic• Host Disk Failure• Fan Broken, Overheat• Migrate / Evacuate the Host

Compute Node HA@Home

Done, let's go and take a beer.

#!/usr/bin/sh while true; do nova service-list --binary nova-compute | \ grep down | \ awk '{print $4}' | \ while read Host; do echo shutdown ${Host} echo evacuate ${Host} done sleep 10 done

Deployment Topology

Controller * 3 ComputeCeph-OSD

ComputeCeph-OSD

PXE &PuppetServer

Public / Floating IP Network

Tenant Network

Storage Network

Management Network

PXE Network

IPMI Network

Controller * 3Controller * 3

Compute Node HA v1

Collect Health Check ResultConsult Action MatrixSend Email, IPMI Power off, EvacuateProblems• Who Supervises the Supervisor?

• Monitoring Host or Service Failure• What if It Loses IPMI Connection?

• IPMI Network Failure• Compute Node Power Failure

• Scalability• 1 * Monitoring Service Instance• N * Hundreds of Compute Nodes

An Example Action Matrix

MgmtNetwork

StorageNetwork

TenantNetwork

OtherChecks

Action

Down Up Up ... Email

Up Down Up ... Fence,Evacuate

Up Up Down ... Migrate

Down Up Down ... ?

Down Down Down ... IPMI?Fence,Evacuate

Distributed Health Checking

Consul● Service Discovery Tool

● Service Registry● DNS Interface● Configuration Template

● Distributed K/V Store● Node Health Check

● Script, HTTP, TTL, Edge Triggered● Large Number of Nodes

● Event Broadcast, Filtering and Watch● Gossip Implementation from Serf

● Session and Lock● Leader Election

● REST API

Controller 1

Compute 1

Controller 2 Controller 3

Raft (Between Servers)Gossip (Between All Nodes)

ConsulServer

Check 1Check 2

ConsulAgent 1

Compute 2

Check 1Check 2

ConsulAgent 1

Compute 3

Check 1Check 2

ConsulAgent 1

The Gossip Protocol

Node 1 Node 2

Node 3 Node 4

Node 5

Message: Blah

Reference: https://www.serfdom.io/docs/internals/gossip.html

Indirect Probe

Node 1Node 1 Node 2

Node 4

Node 3

Probe Node 2

Suspicious Node

Node 1 Node 2

Node 3 Node 4

Node 5

Node 2 is Suspicious

Gossip the Failure Message

Node 1Node 1 Node 2

Node 3 Node 4

Node 5

Node 2 is Offline

Compute Node HA v2

Health Check● Network Connectivity Status Provided by Consul Member Information● Compute Nodes can Register Health Checks in Consul

Monitoring Service HA● Leader Election and Release● Consul Session and Lock

Fence● IPMI Power off● Consul Event Broadcast

Controller 1

Consul Mgmt

Consul Storage

Consul Tenant

Controller 2

Consul Mgmt

Consul Storage

Consul Tenant

Controller 3

Consul Mgmt

Consul Storage

Consul Tenant

Compute 1

Consul Mgmt

Consul Storage

Consul Tenant

Compute 2

Consul Mgmt

Consul Storage

Consul Tenant

Mgmt Network

Storage Network

Tenant Network

IPMI Network

Monitoring Service Monitoring Service Monitoring Service

Monitoring Service HA

Run Multiple Monitoring Service● Service Instance with Leader Role Does the Work

Leader Lock and Release● Consul Session and Lock

● Leader Checks Consul Member Info in Management, Storage and Tunnel Network● If it Loses Other Controllers in Mgmt/Storage/Tunnel, Release the Lock● Lock Timeout when Host/Service Failure

Fenced Node List● Avoid Shutting Down the Machine in Loop● Consul K/V Store

A Complement Fence Method

Controller 1

GossipConsul Server

Compute 1

Consul Agent

Fence Compute 1

Shutdown

Controller 1

GossipConsul Server

Compute 1

Consul Agent Shutdown

Send & Receive Fence Message via Consul Event Mechanism

Compute Node Suicide when Losing Fence Channel

vs. Pacemaker-remote

Pacemaker● Cluster Orchestration Tool, Based on Corosync, Support Fencing● Resource, Clone, Master-slave● Location, Colocation Rules● Limited Cluster Size● Corosync RRP Mode: Passive or Active

Pacemaker-remote● No Limit on Cluster Size● One Heartbeat Network

● Usually Management Network● Fence the Node when Lost Heartbeat!!!

● ocf:pacemaker:pingd● Ping who?

PacemakerRule EngineController 1

Consensus(Corosync)

Pacemker-remoteCompute 1

Pacemker-remoteCompute N

vs. Zookeeper

Zookeeper● Coordinate Distributed Applications● Service Discovery● Hierarchical K/V Store● Distributed Lock● Ephemeral ZNodes● Nova Service Group API Integration

Server Heartbeat Load is Linear to Node NumberNo Other Health ChecksMessage Queue is Handled by Servers with Quorum

● Not for High Message Rate and Large Cluster

Future Improvements

Limit on Evacuate Rate and CountAdd More Health Check ItemsConfigurable Action MatrixReserve Bandwidth for Gossip Message

● Gossip Traffic Calculation● 5 Message / s to 3 Nodes = 41 KB/s● Takes 1.25s for 99 % Convergence in a 30 Node Cluster

Watchdog Device

Nova Considerations

Nova Considerations: Force Service Down

PUT /v2/{tenant_id}/os-services/force-down

{ “host”: “HostA” “binary": “nova-compute”, “forced_down”: true }

Service API

Enabled In Liberty with Microversions 2.11

Nova Considerations: Evacuate Instance

POST /v2/{tenant_id}/servers/action

{ "evacuate": { "host": "hostA", "onSharedStorage": "False" } }

Evacuate API

{ "evacuate": { "onSharedStorage": "False" } }

Evacuate API

Host Become optional in Juno

{ "evacuate": { "onSharedStorage": "False" } }

Evacuate API

Host Become optional in JunoPersist scheduling policy in Progress

{ "evacuate": { } }

Evacuate API

Host Become optional in JunoPersist scheduling policy in Progress

onSharedStorage optional in Progress

Ceilometer Considerations

High Availability Based CeilometerpTwo Ceilometer Metrics ● Ping Check Metric - instance.ping.delay

- delay <= timeout, return delay - delay > timeout, return 999

● Storage Health - instance.disk.health - writable, return health (health = 0) - unwritable, return -1

pAssumed Conditions of HA Alarm ● every num period:

- avg(delay) > 500, trigger alarm - sum(health) < 0, trigger alarm

Compute Node

VM qemu-ga

ceilometer-compute-agent

Default GW Disk

Ping Check Write Check

Ceilometer Node

ceilometer-collector

database gnocchi

Alarm/Notification

EMail SMS REST API

High Availability Based CeilometerpTwo Ceilometer Alarm: ping-delay-alarm and disk-health-alarm

pTwo Ceilometer Alarm Actions: Email, SMS and REST API pREST API as the HA handler ● nova live migrate ● nova reboot api ● nova rebuild api

ping-delay-alarm --meter-nameinstance.ping.delay--threshold500--comparison-operatorgt--statisticavg--period10--evaluation-periods3

disk-health-alarm --meter-namedisk.health--threshold0--comparison-operatorlt--statisticavg--period10--evaluation-periods3

High Availability Based CeilometerpPros ● high vailability of virtual machine level ● tenant network failures, storage network failures

pCons ● Don’t deal with management network and IPMI network failures ● Too many duplicatable checks if the host is failures ● Must depend on qemu guest agents

pOptimizing - Consul mechanism can overcome cons above

Thanks

Distributed Health Checking for Compute Node High Availability€¦ · Service Discovery Tool...

Documents

Distributed Weighted-Multidimensional Scaling for Node ...web.eecs.umich.edu/~hero/Preprints/wmds_v9.pdf · Distributed Weighted-Multidimensional Scaling for Node Localization in

LBNL Node Health Check Update

FastFlow: targeting distributed systems - unipi.itcalvados.di.unipi.it/storage/talks/2012_dFF_Paraphrase...FastFlow node (3) A sequential node is eventually (at run-time) a posix-thread

Distributed Detection of Node Capture Attacks in Wireless Sensor

Sentinel Lymph Node Biopsy - Shared Health

VXLAN Distributed Service Node

A Distributed Anchor Node Selection Algorithm Based on

Efficient Distributed Algorithms for Data Fusion and Node Localization in Mobile Ad-hoc Networks

Distributed Intelligent Systems – W07: An Introduction to ......Wireless sensor networks: – are spatially distributed systems – exploit wireless networking as main inter- node

Chapter 1: Distributed Systems: What is a distributed · PDF fileChapter 1: Distributed Systems: What is a distributed ... Service vs. server (node or process ) File system ... The

An Architecture for a Power-Aware Distributed Microsensor Node · An Architecture for a Power-Aware Distributed Microsensor Node Rex Min, Manish Bhardwaj, ... Smart Dust PicoRadio

Distributed PostgreSQL - Percona · Postgres-XL Fork of PostgreSQL 10 to provide Scalable PostgreSQL Hash distributed tables and replicated tables Multi-node transactions, Multi-node

Query Health: Distributed Population Queries

BACK BAY DISTRIBUTED ANTENNA SYSTEM PROPOSED NODE …

Western Node Collaborative Saskatoon Health Region Medication Reconciliation

A2DataDive keynote: Distributed Health Technologies

Mediastinoscopy: Lymph Node Biopsy - Veterans Health Library

Distributed Systems - Brown UniversityAll Facebook Data Node A Node B Node C Shard 1 Shard 1 Shard 1 Node D Node E Node F Shard 2 Shard 2 Shard 2 Partition data into shards, maps shards

Design for a Distributed Name Node

Distributed Weighted-Multidimensional Scaling for Node ...npatwari/pubs/wmds.pdf · Distributed Weighted-Multidimensional Scaling for Node Localization • 41 ultrawideband (UWB)