32
Airflow Clustering and High Availability By: Robert Sanders

Airflow Clustering and High Availability

Embed Size (px)

Citation preview

Page 1: Airflow Clustering and High Availability

Airflow Clustering and

High AvailabilityBy: Robert Sanders

Page 2: Airflow Clustering and High Availability

2Page:

Agenda

• Airflow Daemons• Single Node Deployment• Cluster Deployment• Scaling

• Worker Nodes• Master Nodes

• Limitations• Airflow Scheduler Failover Controller• Failover Controller Procedure

Page 3: Airflow Clustering and High Availability

3Page:

Airflow Daemons

• Web Server• Daemon that runs the Airflow Webserver• 1 to many gunicorn processes to accept and process requests in

parallel.• Allows you to track jobs progress, run jobs and more

• Scheduler• Periodically runs (every X seconds) to determine if a DAG or Task

needs to be ran based off the DAG schedule• Pushes messages to the Queuing Service to be executed

• Worker• Daemon runs if you’re using the CeleryExecutors (as opposed to

SequentialExecutor and LocalExecutor)• 1 to many dedicated celeryd processes which execute functions• Pulls messages from a Queuing services to determine what

functions to execute

Page 4: Airflow Clustering and High Availability

4Page:

Single Node Deployment

Page 5: Airflow Clustering and High Availability

5Page:

Cluster Deployment

Page 6: Airflow Clustering and High Availability

6Page:

Why setup a Cluster Deployment?

• Distributes heavy processes onto many machines for better use of resources

• More Highly Available Airflow environment• If you have many Workflows with many Tasks your executors

would not be able to get to all the messages in the queue. Adding more executors would fix this issue.

Page 7: Airflow Clustering and High Availability

7Page:

Scaling Workers

• Horizontally• Add more machines to the cluster• No need to register the machines with the master. You

just need to start up the Airflow Worker task on the new Machine.

• Vertically• Increase the number of executors (celeryd processes) per

node and restart the workers

Page 8: Airflow Clustering and High Availability

8Page:

Scaling Master

Page 9: Airflow Clustering and High Availability

9Page:

Limitations

• There can only be one scheduler running at a time• If you have multiple Scheduler processes running, there's

a possibility that multiple instances of a single task that will be scheduled to run.

• If the Scheduler Daemon or Machine with the process goes down then no jobs will get scheduled

Page 10: Airflow Clustering and High Availability

10Page:

Airflow Scheduler Failover Controller

• Dedicated Daemon that runs with Airflow on the Master Nodes

• Ensures that there is always one and only one Scheduler running on the Master nodes at a time

• Developed Internally and Open Sourced• https://github.com/teamclairvoyant/airflow-scheduler-fail

over-controller

• High Level Steps• Polls (every x seconds) to check if the scheduler is

running• If scheduler isn’t running, restart the scheduler• If it still doesn’t start up, then try starting it up on the

other master nodes

Page 11: Airflow Clustering and High Availability

11Page:

Failover Controller Diagram

Page 12: Airflow Clustering and High Availability

12Page:

Start Up Scenario

Page 13: Airflow Clustering and High Availability

13Page:

Failover Controller Process (Start Up)

Master Node 1

Failover Controller(standby)

Master Node 2

Failover Controller(standby)

On startup, the processes start out in STANDBY

Page 14: Airflow Clustering and High Availability

14Page:

Failover Controller Process (Start Up)

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

The first one to enter data into the Metastore is elected as the active controller.

Page 15: Airflow Clustering and High Availability

15Page:

Failover Controller Process (Start Up)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

The Failover controller checks to see if the Scheduler is running, but it isn’t.

Page 16: Airflow Clustering and High Availability

16Page:

Failover Controller Process (Start Up)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Failover Controller starts up the Scheduler

Page 17: Airflow Clustering and High Availability

17Page:

Scheduler Failure Scenario

Page 18: Airflow Clustering and High Availability

18Page:

Failover Controller Process (Process Failure)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Scheduler process has died

Page 19: Airflow Clustering and High Availability

19Page:

Failover Controller Process (Process Failure)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Failover Controller restarts the Scheduler

Page 20: Airflow Clustering and High Availability

20Page:

Scheduler Failure and Failed Restart

Scenario

Page 21: Airflow Clustering and High Availability

21Page:

Failover Controller Process (Process Failure 2)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Scheduler process has died

Page 22: Airflow Clustering and High Availability

22Page:

Failover Controller Process (Process Failure 2)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Failover Controller tries to restart the Scheduler, but its still not running

Page 23: Airflow Clustering and High Availability

23Page:

Failover Controller Process (Process Failure 2)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Failover Controller tries to restart the Scheduler on a different node

Page 24: Airflow Clustering and High Availability

24Page:

Failover Controller Process (Process Failure 2)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Failover Controller succeeds to restart the scheduler and the cluster is back to normal

Page 25: Airflow Clustering and High Availability

25Page:

Node Failure Scenario

Page 26: Airflow Clustering and High Availability

26Page:

Failover Controller Process (Node Failure)

Scheduler

Master Node 1

Failover Controller(active)

Master Node 2

Failover Controller(standby)

Everything is running as expected

Page 27: Airflow Clustering and High Availability

27Page:

Failover Controller Process (Node Failure)

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

Failover Controller(standby)

Master Node 1 dies and all the processes running on it are gone

Page 28: Airflow Clustering and High Availability

28Page:

Failover Controller Process (Node Failure)

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

Failover Controller(active)

Failover Controller on Master 2 becomes active because the one running on Master Node 1 has stopped sending a heart beat

Page 29: Airflow Clustering and High Availability

29Page:

Failover Controller Process (Node Failure)

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

Failover Controller(active)

The newly active Failover Controller tries to check-in with and restart the Scheduler on the daemon the Metadata says its running on and fails.

Page 30: Airflow Clustering and High Availability

30Page:

Failover Controller Process (Node Failure)

Scheduler

Master Node 1

Failover Controller

(dead)

Master Node 2

Failover Controller(active)

The Failover Controller then starts it on another node and it succeeds

Scheduler

Page 31: Airflow Clustering and High Availability

31Page:

Failover Controller Process (Node Failure)

Master Node 1

Failover Controller(standby)

Master Node 2

Failover Controller(active)

When Master Node 1 is brought back, the old Failover Controller goes into STANDBY state

Scheduler

Page 32: Airflow Clustering and High Availability

32Page:

Q&A