August 2016 HUG: Recent development in Apache Oozie

Recent Development in Oozie

Purshotam Shah (purushah@yahoo-inc.com)Satish Saley (saley@yahoo-inc.com)

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

Why Oozie?

Out-of-box support for multiple job types Java, shell, distcp Mapreduce

• Pipes, streaming pig, hive, spark

Highly scalable High availability

Hot-Hot with rolling upgrades Load balanced

Hue Integration

HCatalog

Scale at Yahoo

Deployed on all clusters (production, non-production)One instance per cluster

75 products / 2000 + projects255 monthly users

90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min

2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator

97 bundle jobs daily (June 2016, one busy cluster)

Agenda

Oozie at Yahoo1

Oozie unit testing

Spark Action

Future Work

Data Pipelines

Ad ExchangeAd LatencySearch Advertising

Content ManagementContent OptimizationContent PersonalizationFlickr Video

Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting

Advertisement Content Targeting

Data Pipelines

Anti SpamContentRetargeting

ResearchDashboards & ReportsForecasting

Email Data Intelligence Data Management

Audience Pipeline

Use Case - Data pipeline

Oozie Coordinator<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>

Current limitation of Oozie coordinator

• All dataset are required• All instance are forced• We can’t combine datasets from multiple provider• There is no way to assign priority among datasets

Complex dependencies

OOZIE-1976 : Specifying coordinator input datasets in more logical ways

Oozie Coordinator with input logic<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <input-logic> <or name=“input1ORinput2”> <data-in dataset=“input1”/> <data-in dataset=“input2"/> </or> </input-logic>…...............

BCP Support

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>

Minimum availability processing

Some time, we want to process even if partial data is available.

<input-logic><data-in dataset=“A" min=”4”/>

</input-logic>

Optional feeds

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

Wait for primary

Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.

<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

Agenda

Oozie at Yahoo1

Oozie unit testing

Spark Action

Future Work

MiniOozie

MiniOozie HCat Pig Hive Spark

MiniOozieClient To communicate with oozie server.

Oozie unit Yamlname: TestCoordinatorjob: properties: raw_logs_path: "/tmp/test/input" aggregated_logs_path: "/user/test/output” oozie.coord.application.path: src/test/resources/coordinator-test.xmlhdfs: touchz: - /tmp/test/input/2010/02/01/09/_SUCCESS - /tmp/test/input/2010/02/01/10/_SUCCESS mkdir: - /user/test/outputvalidations: validate_job: sleep: 6000 coordinator_actions: - coordinator_action : "@2" not_status: WAITING nominal_time: 2010-02-01T11:00Z

Agenda

Oozie at Yahoo1

Oozie unit testing

Spark Action

Future Work

Spark Action

Yahoo Confidential & Proprietary

• Oozie native support for Apache Spark jobs

• Introduced last year in Apache Oozie 4.2.0

Example

<mode>cluster</mode>

<name>Spark-FileCopy</name>

<class>org.apache.oozie.example.SparkFileCopy</class>

<jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>

<file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file>

<archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive>

<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts>

<arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg>

<arg>${nameNode}/${examplesRoot}/output-data/spark</arg>

</spark>

PySpark Example

Automatically sets up pyspark.zip and py4j-src.zip from Sharelib

<mode>cluster</mode>

<name>PySparkExample</name>

<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>

<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>

</spark>

Modes supported

• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting any properties for Driver, property should be prefixed with oozie.launcher.

• For ex, oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts should be modified for increasing driver memory.

Master Mode

local[*]

yarn client

yarn cluster

Recent enhancements

• Support for PySpark jobs

• Show Spark Job URLs in Oozie UI under Child Jobs Tab

• Automatically include spark-defaults.conf from Sharelib

• Support for <file> and <archive>

• Faster job launch time• Simplify setting up of classpath

• Avoid re-uploading jars for localization by reusing hdfs paths in mapreduce.job.cache.files

• Couple of bug fixes

Agenda

Oozie at Yahoo1

Oozie unit testing

Spark Action

Future Work

Oozie Unit testing framework No unit tests now. Directly tested by running in staging

Coordinator Dependency management Better reprocessing

Aperiodic processing Managed through workarounds

August 2016 HUG: Recent development in Apache Oozie

Technology

Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Oozie Summit 2011

April 2014 HUG : Apache Sentry

October 2013 HUG: Oozie 4.x

Apache Oozie

NYC HUG - Application Architectures with Apache Hadoop

Oozie @ Riot Games

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

CLUSTER CONTINUOUS DELIVERY WITH OOZIE is a workflow scheduler system to manage Apache Hadoop jobs. ... Jenkins / Build Server Driven Deployments: Git Repo

April 2014 HUG : Apache Phoenix

May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop

August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ignite™

Oozie Hug May 2011

Oozie HUG May12

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop

October 2014 HUG : Oozie HA

Oozie Deck

Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Oozie Tutorial

Apache Oozie Essentials - Sample Chapter