Azkaban - WorkFlow Scheduler/Automation Engine

Preview:

DESCRIPTION

Azkaban - WorkFlow Scheduler/Automation Engine Seminar given at KPMG by Praveen Thirukonda.

Citation preview

Hadoop WorkFlow Scheduler / Automation

EngineAzkaban & Oozie

Praveen Thirukonda Senior Associate

Data & AnalyticsOrange County, CA

09/11/2014

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

2

What is a workflow?

- A workflow is a Directed Acyclic Graph (DAG) of “jobs” where each job has one or more inputs and outputs.

- A workflow scheduler helps us manage the co ordination among the various jobs.

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

3

When do we need a workflow scheduler?

- In a Data Pipeline, Batch jobs need to be scheduled to run periodically.

- They also typically have intricate dependency chains—for example, dependencies on various data extraction processes or previous steps.

- Larger processes might have 50 or 60 steps, of which some might run in parallel and others must wait for the output of earlier steps.

Azkaban

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

5

What is Azkaban?

- “cron on steroids”- A workflow scheduler can be seen as a

combination of the cron and make Unix utilities combined with a friendly UI.

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

6

What is Azkaban?

- Azkaban was implemented at LinkedIn to solve the problem of Hadoop job dependencies.

- Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

7

An Image is worth a 1000 words..

Apache Oozie

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

9

What is Apache Oozie?

- Similar to Azkaban. - Whereas Azkaban uses a series of

Properties files, Oozie uses an XML file.- Oozie supports Java API, command line

methods for workflow submission in addition to Browser interface/REST API.

- Oozie is part of our Hortonworks environment in our cluster.

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

10

Advantages of using a workflow scheduler

- Let’s you easily manage dependencies within the various tasks.

- Scheduling of workflows- Monitor the progress of your workflow with

nice interface.- Email alerts on failure and successes- Retrying of failed jobs.

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.

11

Application of a workflow scheduler

- Real Life example of how and where you might use a workflow scheduler in your Big Data System architecture?

Thank you

Presentation by Praveen Thirukonda

© 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved.

The KPMG name, logo and “cutting through complexity” are registered trademarks or trademarks of KPMG International.

Recommended