Airflow @ Agari

  • View
    1.027

  • Download
    0

Embed Size (px)

Text of Airflow @ Agari

  • Apache Airflow @ Agari

    Sid Anand (@r39132) Apache Airflow Meet-up June 2016

    1

  • Agari

    2

    What We Do!

  • Agari : What We Do

    3

  • 4

    Agari : What We Do

  • 5

    Agari : What We Do

  • 6

    Agari : What We Do

  • 7

    Agari : What We Do

  • 8

    Enterprise Customers

    email metadata

    apply trust

    models

    email md + trust score

    Agaris Current Product

    Agari : What We Do

  • 9

    email metadata

    apply trust

    models

    email md+ trust score

    Agaris Future ProductEnterprise Customers

    Agari : What We Do

  • Apache Airflow @ AgariHow Do We Use It?

    10

  • 2 Classes of Orchestration

    11

    apply trust models

    (message scoring)

    build trust models

    cron++ (general

    job scheduler)

    New Product (Enterprise Protect)

    Operational Automation

  • 2 Classes of Orchestration

    12

    apply trust models

    (message scoring)

    build trust models

    cron++ (general

    job scheduler)

    New Product (Enterprise Protect)

    Operational Automation

    This Talk

  • Use-Case : Message ScoringBatch Pipeline Architecture

    13

  • Use-Case : Message Scoring

    14

    enterprise Aenterprise Benterprise C

    S3

    S3 uploads every 15 minutes

  • Use-Case : Message Scoring

    15

    enterprise Aenterprise Benterprise C

    S3

    Airflow kicks of a Spark message scoring job

    every hour

  • Use-Case : Message Scoring

    16

    enterprise Aenterprise Benterprise C

    S3

    Spark job writes scored messages and stats to

    another S3 bucket

    S3

  • Use-Case : Message Scoring

    17

    enterprise Aenterprise Benterprise C

    S3

    This triggers SNS/SQS messages events

    S3

    SNS

    SQS

  • Use-Case : Message Scoring

    18

    enterprise Aenterprise Benterprise C

    S3

    An Autoscale Group (ASG) of Importers spins up when it detects SQS

    messages

    S3

    SNS

    SQS

    Importers

    ASG

  • 19

    enterprise Aenterprise Benterprise C

    S3

    The importers rapidly ingest scored messages and aggregate statistics into

    the DB

    S3

    SNS

    SQS

    Importers

    ASGDB

    Use-Case : Message Scoring

  • 20

    enterprise Aenterprise Benterprise C

    S3

    Users can review untrusted emails in

    the web app

    S3

    SNS

    SQS

    Importers

    ASGDB

    Use-Case : Message Scoring

  • 21

    enterprise Aenterprise Benterprise C

    S3 S3

    SNS

    SQS

    Importers

    ASGDB

    Airflow manages the entire process

    Use-Case : Message Scoring

  • Use-Case : Message ScoringAirflow DAG

    22

  • 23

    Airflow DAG

  • 24

    5 minute wait for S3 eventual consistency

    Airflow DAG

  • 25

    1 hour a day, we also build new models

    Airflow DAG

  • 26

    build models

    Airflow DAG

  • 27

    dummy needed for branch operator

    Airflow DAG

  • 28

    trigger_rule: one_success

    ExternalTaskSensor : on the previous hour of the same DAG

    Airflow DAG

  • 29

    Prep Spark Run

    Airflow DAG

  • 30

    Run Spark Verify a record written

    to the DB Wait for the SQS

    queue to empty

    Airflow DAG

  • 31

    Compute discrepancies

    Send email report

    Update monitoring graphs

    Raise SLA (correctness) alerts

    Airflow DAG

  • SLAs & InsightsAirflow

    32

  • 33

    Desirable Qualities of a Resilient Data Pipeline

    OperabilityCorrectness

    Timeliness Cost

    Data Integrity (no loss, etc) Expected data distributions

    All output within time-bound SLAs (e.g. 1 hour)

    Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

    Quick Recoverability

    Pay-as-you-go

  • 34

    Desirable Qualities of a Resilient Data Pipeline

    OperabilityCorrectness

    Timeliness Cost

    Data Integrity (no loss, etc) Expected data distributions

    All output within time-bound SLAs (e.g. 1 hour)

    Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

    Quick Recoverability

    Pay-as-you-go

    SLA

    SLA

  • 35

    Correctness : Email Reporting

    For each org, we check for duplicate or missing data as a count & percentage

    orgs

  • 36

    Correctness : Email Reporting

    These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!

    orgs

  • 37

    Correctness : Monitoring

  • 38

    Airflow: And easy to integrate with Ops tools!

    Correctness & Timeliness : Alerting

  • 39

    Airflow: And easy to integrate with Ops tools!

    Correctness & Timeliness : Alerting

    Timeliness SLA miss

  • 40

    Airflow: And easy to integrate with Ops tools!

    Correctness & Timeliness : Alerting

    Timeliness SLA miss

    dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)

  • 41

    Airflow: And easy to integrate with Ops tools!

    Correctness & Timeliness : Alerting

    Timeliness SLA miss

    Correctness SLA miss

  • 42

    Airflow: And easy to integrate with Ops tools!

    Correctness & Timeliness : Alerting

    Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps

  • Operations

    43

    Managing Airflow

  • 44

    Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

    We have 3 environments (Dev, Stage, Prod), each with its own AWS account

    Agari in AWS

    Dev A Developers playground developers prototype new services here

  • 45

    Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

    We have 3 environments (Dev, Stage, Prod), each with its own AWS account

    Agari in AWS

    Dev A Developers playground developers prototype new services here

    Stage A Pre-Production Env

    All new code & system upgrades are baked here prior to Prod deployment

    It gets a copy of prod data & stage data : helps us test for scale!

  • 46

    Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

    We have 3 environments (Dev, Stage, Prod), each with its own AWS account

    Agari in AWS

    Dev A Developers playground developers prototype new services here

    Stage

    Prod Prod Env

    A Pre-Production Env All new code & system upgrades are baked here prior to Prod

    deployment It gets a copy of prod data & stage data : helps us test for scale!

  • 47

    Security

    Each environment has its own AWS account for isolation

    We use Security Groups & VPCs to lock every EC2 instance down

    In other words, EC2 nodes only accept inbound traffic from other hosts in the same or peered VPCs

    Exception : Each env has one bastion host The bastion host provides indirect access to the nodes in the VPC

    Agari in AWS

  • Agaris Airflow in AWS

    48

    Airflow DB

    Apache

    bastion host An Apache Reverse-proxy, running on the bastion host, does:

    SSH termination & Authentication via mod_auth

    sched

    workflow host

    web server

    SSL

    The workflow machine hosts airflow webserver airflow scheduler (running LocalExecutor)

    A Postgresql DB instance hosts the Airflow database

  • Fault Tolerance

    49

    Airflow DB

    Apache

    bastion host

    sched

    workflow-00

    web server

    SSL

    sched

    workflow-01

    At Agari, we use Monit to keep processes running on any EC2 instance, e.g. : Apache webserver on the bastion host Airflow webserver & scheduler on the workflow host Postgresql on the DB host

    In Prod, we run 2 schedulers for increased scalability and fault-tolerance

  • 50

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

  • 51

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

    Deployed to stage & prod

  • 52

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

    Agari / Airflow pinned to a release version (e.g. 1.6.2)

  • 53

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

    r39132 / Airflow kept in sync upstream master

  • 54

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

    r39132 / Airflow sends PR 191

  • 55

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

    PR 191 is cherry-picked into Agari / Airflow

  • 56

    Airflow Github Repos

    Apache/ Airflow

    Agari/ Airflow (fork)

    r39132/ Airflow (fork)

    Deployed to stage & prod

    What is deployed is 1.6.2 + cherry-picked (e.g. PR 191) changes

  • Acknowledgments

    57

    Vidur Apparao Stephen Cattaneo Jon Chase Andrew Flury William Forrester Chris Haag Mike Jones