Building Data Pipelines in Python using Apache...

Preview:

Citation preview

Building Data Pipelines in Python using Apache Airflow

STL Python Meetup Aug 2nd 2016 @conornash

What is Apache Airflow?• Airflow is a platform to

programmatically author, schedule and monitor workflows

• Designed for batch jobs, not for real-time data streams

• Originally developed at AirBnB by Maxime Beauchemin, now incubating as an Apache project

Why would you want to use it?

• Companies grow to have a complex network of processes and data that have intricate dependencies

• Analytics & batch processing is becoming increasingly important

• Want to find a way to scale up analytics/batch processing while keeping time spent writing/monitoring/troubleshooting to a minimum

• Useful even for small workflows/batch jobs

Airflow Features• Dependency management (DAGs)

• Status visibility

• Scheduling

• Log storage/retrieval

• Parameterized retries

• Distributed DAGs (RabbitMQ)

• Queues

• Pools

• Branching/Partial Success

• SLA monitoring

• Jinja templating

• Plugin system and more…

Airflow: Dashboard

Airflow: DAG

Quick start requirements• Python 2 or 3

• Make new project (virtualenv, pyenv, …)

• $ cd <project folder path> && export AIRFLOW_HOME=<project folder path>

• $ pip install airflow

• $ airflow initdb

• $ airflow webserver -p 8080

Airflow: First DAG

• Existing Python/Bash/Java/etc. script that is difficult to monitor

• Probably already set up as a cron (Unix) or scheduled task (Windows)

• Want to integrate it into an Airflow DAG

Airflow: First DAG

Airflow: Complex DAG

Airflow: Complex DAG

Why would you want to use it?• Data Warehousing

• Anomaly Detection

• Search Ranking

• Model Training

• Text Analysis

• Experimentation (i.e. A/B tests)

• Data Cleaning

• 3rd Party Data Integration

Q&ATwitter: @conornash

Email: conor@conornash.com

Recommended