Failover and takeover contingency mechanisms for network partition and node failure

Failover and Takeover Contingency Mechanismsfor Network Partition and Node Failure

Macías López, Laura M. Castro, David Cabrero

MADS Research Group – Universidade da Coruña (Spain)

Erlang WorkshopCopenhaguen, 14th September 2012

Erlang Workshop (2012) Fail/Takeover Mechanisms 1 / 25

Why are we (all) here?

Why are we (presenting this work) here?

concurrency!

high-availability!

distribution!

Unexpected problemsafter deployment!

node failures!system failure!

Outline

1 The system

2 The problems at deployment

3 The solution

4 Final remarks

The systemADVERTISE

Distributed system for advertisement transmission to on-customer-homeset-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator

The systemADVERTISE’s requirements

ensure the appropriate coordination of advertising mechanisms:

I compilation of events

I emission of advertising signals to STBs during a period of time

I recording hits (displays) of a specific piece of advertisement

Major challengeManagement of the size of the communications network:

growing number of operator’s customers (∼ 100.000)

The systemADVERTISE’s architecture

The systemADVERTISE’s structure

The systemADVERTISE as Erlang Distributed Application

To meet its requirements, ADVERTISE was designed

as a distributed application over several nodes

The systemADVERTISE as Erlang Distributed Application

To meet its requirements, ADVERTISE was designed

as a distributed application over several nodes

The problems at deploymentThe symptoms

ADVERTISE deployment environment

presented some particularities that had not been foreseen:

some nodes showed a tendency to fail more often than others

network partition was common during some time periods (noon,

night)

In this situation. . .Fault tolerance requirements were not met!

The problems at deploymentThe diagnosis

ADVERTISE was developed and tested over several physical machines

running on a single physical machine

using a shared hard disk

sharing the network link

sharing with other apps/VMs

Frequent saturation of shared resources was perceived by ADVERTISEnodes as short network partitions.

ADVERTISE was deployed over several virtual machines

The problems at deploymentThe consequences

If nodes lose connectivity, believe that all the others are down and assume

system functions, there are likely to be inconsistencies when connectivity

is restored (duplicated responsibilities, data inconsistencies).

Perceived network partitions led to cascade failoversDuplicated registration of global names, random killing of conflicting

processes, overflow and eventual stop of the supervision mechanisms.

The solution

For ADVERTISE, data consistency was more important

than availability:

system could not afford that advertising campaigns, rules,

or media were lost or became inconsistent

instead, it was acceptable that no ads were sent to STBs

(or that they were delayed)

The solutionWe re-designed ADVERTISE to be deployed over a minimum of 3 nodes,

and never on an isolated node

The solutionADVERTISE initialisation

The solutionADVERTISE boot

The solutionNode integrity check

1 Retrieve the last known population of active nodes Listactives

2 Retrieve the list of all ADVERTISE nodes from the configuration Listall

3 Filter Listall removing ping-unreachable nodes

(filtered(Listall) 6= Listactives) ∧ (|Listactives| = 1)

ADVERTISE is suspended immediately,

and node is rebooted once connectivity is restored

The solutionDistributed AC check

1 DAC is queried on all nodes, to get PID of ADVERTISE local sup

2 If ∃n ∈ Listall for which ADVERTISE local sup PID could not be

retrieved, node failure is assumed

1 If n ∈ Listactives it means it replies to ping from the global supervisor but

cannot reach others; after a timeout

1 If n /∈ Listactives node failure is confirmed

2 If n ∈ Listactives node is up and we reboot it

The solutionCurrent ADVERTISE deployment

Cluster of 3 virtual nodes, handles an average of 18K STBs per node

with peaks of 23K STBs during prime time

Our tests reached a maximum of 45K STBs per node

System running with no incidents reported in the last 4 months

Most intensive advertising campaign was a 2-month promotion:

displayed over 66 million times, with a peak of 140K times in 1 hour

Average campaign can be displayed a total of 500K, with peaks of up

to 30K in 1 hour during prime time Saturday night

Final remarksLessons learned

When designing a distributed Erlang app, one must take into account:

Network reliability

Latency of requests

Bandwidth

Network security

Network topology

Heterogeneity of components

Scalability

Final remarksLessons learned

When designing a distributed Erlang app, one must take into account:

Network reliability

Latency of requests

Bandwidth

Network security

Network topology

Heterogeneity of components

Scalability

Final remarksYour mileage may vary!

Had ADVERTISE requirements been substantially different

we would probably have favoured

availability over consistency, for instance

And that would be a different story. . .

Questions?

Audience ! thanks

Some images and icons were downloaded from: openclipart.org

Failover and takeover contingency mechanisms for network partition and node failure

Technology

Merger and Takeover Attempts in Taiwanese Party …...outcomes: (1) merger, (2) failed merger, (3) negotiated takeover, (4) hostile takeover, and (5) failed takeover. The classification

ITP Takeover Rules - Irish Takeover Panel

RAMELIUS TAKEOVER OFFER FOR EXPLAURUM LTD · explaurum takeover offer • september 2018explaurum takeover offer • september 2018 1 ramelius resources limited ramelius takeover

Mobile Takeover

14 Failover and Instant Failover - Thin Client Software ...288 Failover and Instant Failover ACP ThinManager 6.0 Failover Step 3 – Thin Client Automatically Switches to Secondary

Failover Overview

NuGames TakeOver

Configuring Active/Active Failover - · PDF fileInformation About Active/Active Failover The failover group ... failover group passing ... the ASA console on the unit sending the configuration

Takeover Leaflet

Takeover code

The PROMESA: A takeover is a takeover is a takeover

EDB Postgres Failover Manager Guide...1.1 What’s New The following changes have been made to EDB Postgres Failover Manager 2.0 to create EDB Postgres Failover Manager 2.1: Failover

TakeOver Dauphin

NetApp - MetroCluster Version 8.2.1 Best Practices for ...NetApp HA leverages takeover functionality, otherwise known as cluster failover (CFO), to protect against controller failures

hostile takeover

Germany’s takeover

Takeover and takeover defenses

Takeover Code.pdf

Reverse Takeover

Failover - Cisco · if specific failover conditions are met. If those conditions are met, failover occurs. The ASA supports two failover modes, Active/Active failover and Active/Standby