26
Airflow at WePay Chris Riccomini · June 14, 2016

Airflow at WePay

Embed Size (px)

Citation preview

Page 1: Airflow at WePay

Airflow at WePayChris Riccomini · June 14, 2016

Page 2: Airflow at WePay

Who?

Chris RiccominiEngineer at WePayWorking on infrastructure (mostly)Formerly LinkedIn, PayPal

Page 3: Airflow at WePay

Who?

Page 4: Airflow at WePay

Goal• What do we use Airflow for?• How do we operate Airflow?

Page 5: Airflow at WePay

Usage• ETL• Reporting• Monitoring• Machine learning• CRON replacement

Page 6: Airflow at WePay

Usage• ETL• Reporting• Monitoring• Machine learning• CRON replacement

Page 7: Airflow at WePay

Usage• 350 DAGs• 7,000 DAG runs per-day• 20,000 task instances/day

Page 8: Airflow at WePay

Environments

Page 9: Airflow at WePay

Airflow deployment• Google cloud platform• One n1-highcpu-32 machine (32 cores, 28G mem)• CloudSQL hosted MySQL 5.6 (250GB, 12GB used)• Supervisord• Icinga2 (investigating Sensu)

Page 10: Airflow at WePay

Airflow deployment

Page 11: Airflow at WePay

Airflow deployment

pip install git+https://[email protected]/DataInfra/[email protected]#egg=airflow[gcp_api,mysql,crypto]==1.7.1.2+wepay4

Page 12: Airflow at WePay

Airflow scheduler• Single scheduler on same machine as webserver

executor = LocalExecutorparallelism = 64dag_concurrency = 64max_active_runs_per_dag = 16

Page 13: Airflow at WePay

Airflow logs• /var/log/airflow• Remote logger points to Google cloud storage• Experimenting with ELK

Page 14: Airflow at WePay

Airflow connections

Page 15: Airflow at WePay

Airflow security• Active directory• LDAP Airflow backend• Disabled admin and data profiler tabs

Page 16: Airflow at WePay

DAG development

Page 17: Airflow at WePay

DAG development1. Install gcloud2. Run `gcloud auth login`3. Install/start Airflow4. Add a Google cloud platform connection (just set project_id)

Page 18: Airflow at WePay

DAG development

Page 19: Airflow at WePay

DAG development

Page 20: Airflow at WePay

DAG testing• flake8• Code coverage

Page 21: Airflow at WePay

DAG testing• Test that the scheduler can import the DAG without a failure• Check that the owner of every task is a known team• Check that the email of every task is set to a known team

Page 22: Airflow at WePay

DAG deployment• CRON (ironically) that pulls from airlfow-dags every two minutes

$ cat ~/refresh-dags #!/bin/bashgit -C /etc/airflow/dags/dags/dev clean -f -d git -C /etc/airflow/dags/dags/dev pull

• Webserver/scheduler restarts happen manually (right now)• DAGs toggled off by default

Page 23: Airflow at WePay

DAG characteristics• (almost) All work happens off Airflow machine• Fairly homogenous operator usage (GCP)• Idempotent (re-run a DAG at any time)• ETL DAGs are very small, but there are many of them

Page 24: Airflow at WePay

Questions?(We’re hiring)

Page 25: Airflow at WePay

Addendum

Page 26: Airflow at WePay

Usage