August 2016 HUG: Recent development in Apache Oozie

Preview:

Citation preview

Recent Development in Oozie

Purshotam Shah (purushah@yahoo-inc.com)Satish Saley (saley@yahoo-inc.com)

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

3

Why Oozie?

Out-of-box support for multiple job types Java, shell, distcp Mapreduce

• Pipes, streaming pig, hive, spark

Highly scalable High availability

Hot-Hot with rolling upgrades Load balanced

Hue Integration

Oozie

Hbase

Pig

Hive

Spark

Yarn

HDFS

Hue

HCatalog

Scale at Yahoo

4

Deployed on all clusters (production, non-production)One instance per cluster

75 products / 2000 + projects255 monthly users

90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min

2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator

97 bundle jobs daily (June 2016, one busy cluster)

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Data Pipelines

6

Ad ExchangeAd LatencySearch Advertising

Content ManagementContent OptimizationContent PersonalizationFlickr Video

Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting

Advertisement Content Targeting

Data Pipelines

7

Anti SpamContentRetargeting

ResearchDashboards & ReportsForecasting

Email Data Intelligence Data Management

Audience Pipeline

Use Case - Data pipeline

8

9

Oozie Coordinator<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>

Current limitation of Oozie coordinator

• All dataset are required• All instance are forced• We can’t combine datasets from multiple provider• There is no way to assign priority among datasets

10

11

Complex dependencies

OOZIE-1976 : Specifying coordinator input datasets in more logical ways

12

Oozie Coordinator with input logic<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <input-logic> <or name=“input1ORinput2”> <data-in dataset=“input1”/> <data-in dataset=“input2"/> </or> </input-logic>…...............

BCP Support

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>

13

14

Minimum availability processing

Some time, we want to process even if partial data is available.

<input-logic><data-in dataset=“A" min=”4”/>

</input-logic>

15

Optional feeds

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

16

Wait for primary

Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.

<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>

17

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

18

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

20

MiniOozie

MiniOozie HCat Pig Hive Spark

MiniOozieClient To communicate with oozie server.

21

Oozie unit Yamlname: TestCoordinatorjob: properties: raw_logs_path: "/tmp/test/input" aggregated_logs_path: "/user/test/output” oozie.coord.application.path: src/test/resources/coordinator-test.xmlhdfs: touchz: - /tmp/test/input/2010/02/01/09/_SUCCESS - /tmp/test/input/2010/02/01/10/_SUCCESS mkdir: - /user/test/outputvalidations: validate_job: sleep: 6000 coordinator_actions: - coordinator_action : "@2" not_status: WAITING nominal_time: 2010-02-01T11:00Z

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Spark Action

Yahoo Confidential & Proprietary

• Oozie native support for Apache Spark jobs

• Introduced last year in Apache Oozie 4.2.0

Example

Yahoo Confidential & Proprietary

<spark xmlns="uri:oozie:spark-action:0.2">

<master>yarn</master>

<mode>cluster</mode>

<name>Spark-FileCopy</name>

<class>org.apache.oozie.example.SparkFileCopy</class>

<jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>

<file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file>

<archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive>

<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts>

<arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg>

<arg>${nameNode}/${examplesRoot}/output-data/spark</arg>

</spark>

PySpark Example

Yahoo Confidential & Proprietary

Automatically sets up pyspark.zip and py4j-src.zip from Sharelib

<spark xmlns="uri:oozie:spark-action:0.2">

<master>yarn</master>

<mode>cluster</mode>

<name>PySparkExample</name>

<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>

<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>

</spark>

Modes supported

Yahoo Confidential & Proprietary

• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting any properties for Driver, property should be prefixed with oozie.launcher.

• For ex, oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts should be modified for increasing driver memory.

Master Mode

local[*]

yarn client

yarn cluster

Recent enhancements

Yahoo Confidential & Proprietary

• Support for PySpark jobs

• Show Spark Job URLs in Oozie UI under Child Jobs Tab

• Automatically include spark-defaults.conf from Sharelib

• Support for <file> and <archive>

• Faster job launch time• Simplify setting up of classpath

• Avoid re-uploading jars for localization by reusing hdfs paths in mapreduce.job.cache.files

• Couple of bug fixes

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

29

Future Work

Oozie Unit testing framework No unit tests now. Directly tested by running in staging

Coordinator Dependency management Better reprocessing

Aperiodic processing Managed through workarounds