30
大数据处理工作流调度系统 —— OOZIE及相关产品介绍 邱腾 Teng Qiu http://abcn.net/ http://www.fxlive.de/ ChinaHadoop 公开课

Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

  • Upload
    chutium

  • View
    1.614

  • Download
    0

Embed Size (px)

DESCRIPTION

Oozie Introduction, Case Study, and Tips also some introduction about Integration of Kettle and Oozie using Spoon PDF download: http://user.cs.tu-berlin.de/~tqiu/Oozie_BigData_Workflow_Scheduler_Case_Study.pdf During the past three years Oozie has become the de-facto workflow scheduling system for Hadoop. Oozie has proven itself as a scalable, secure and multi-tenant service. More: http://www.chinahadoop.net/thread-6659-1-1.html Online Open Course: http://chinahadoop.edusoho.cn/course/19 video: http://www.youtube.com/watch?v=qzk08ggdIDw&hd=1 vimeo -- http://vimeo.com/84164730

Citation preview

Page 1: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

大数据处理工作流调度系统 —— OOZIE及相关产品介绍

邱腾 Teng Qiu

http://abcn.net/

http://www.fxlive.de/

ChinaHadoop 公开课

Page 2: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

大纲

● Oozie概述

●适合使用Oozie的情景

● Oozie的实现原理及特点

● Oozie的核心组件

● Oozie实战及Tips

● Oozie的编程接口介绍

●支持Oozie的图形化开源ETL工具Kettle初探

●总结展望

2 Berlin | 2014.01.14 | Teng Qiu

Page 3: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE概述

●工作流引擎

●顺序运行一组Hadoop作业任务

●有向无环图 DAG (Direct Acyclic Graph)

●Workflow 1:1 Coordinator n:1 Bundle

● Coordinator可触发执行,可类似cron job方式执行,时间轮循只支持UTC时间

● XML作为工作流描述语言 hPDL (Process Definition Language)

●类似JBoss jBPM中使用的 jPDL

● Control Flow Nodes 控制流程的执行路径: start, end, fail / kill, decision, fork-join

● Action Nodes:

● HDFS, MapReduce, Pig, Hive, Sqoop, Java, SSH, E-Mail, Sub-Workflow ● (mkdir, delete, move, chmod, touchz, DistCp)

●信息存放在数据库中 derby / mysql

3 Berlin | 2014.01.14 | Teng Qiu

Page 4: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

适合使用OOZIE的情景

● Hadoop中需要按顺序进行的数据处理工作流

●即能够顺序执行,又能够并行处理(fork-join)

●运行结果或异常的通报、处理

● Hadoop集群内ETL任务

●取代Hadoop集群内的Cron Job

4 Berlin | 2014.01.14 | Teng Qiu

Page 5: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

适合使用OOZIE的情景

●需要定期执行的任务,如 ETL

● RDBMS中的表 => HBase Table / Hive Table

● RDBMS中的 trigger / stored procedure

=> HBase的RegionObserver和Endpoint Coprocessor

5 Berlin | 2014.01.14 | Teng Qiu

cron job A,在 hdp01 这个机器上,每个小时的15分启动,处理原始数据集1

cron job B,在 hdp05 这个机器上,每个小时的20分启动,处理原始数据集2

cron job C,在 hdp11 这个机器上,每个小时的50分启动,去读A和B的结果,然后做处理

Page 6: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

适合使用OOZIE的情景

● Hadoop中需要按顺序进行的数据处理工作流

●即能够顺序执行,又能够并行处理(fork-join)

●运行结果或异常的通报、处理

● Hadoop集群内ETL任务

●取代Hadoop集群内的Cron Job

●适用于 batch processing,DWH,不太可能进行 real-time data processing

6 Berlin | 2014.01.14 | Teng Qiu

Page 7: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的实现原理及特点

●实现原理

● oozie:lancher:T=:W=:A=

● Oozie Server根据workflow XML, 提交一个map only的MR Job

●map中封装用户定义的action, 通过JobClient将job.jar和job.xml提交JobTracker

● action Job开始工作,map only Job 等待 => oozie始终多占用一个map slot

● callback / polling 获取action状态

●正常情况下,通过callback URL通知完成

●特点

●通过MapReduce Framework实现负载均衡,容错/重试机制

●支持参数化,Java EL 语言

● DAG,没有重试(Error / Exception / exit code != 0)

●但是workflow可以rerun(oozie.wf.rerun.failnodes=true或oozie.wf.rerun.skip.nodes=xxx,yyy,zzz)

7 Berlin | 2014.01.14 | Teng Qiu

Page 8: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的核心组件 Control Flow Node 流程控制节点

● Oozie的核心组件(流程控制节点介绍)

8 Berlin | 2014.01.14 | Teng Qiu

Page 9: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的核心组件 Control Flow Node 流程控制节点

● decision 节点 ${wf:conf("etl_only_do_something") eq "yes"}

● fork-join

●一个bug:OOZIE-1142,3.3.2后fix

●解决办法:在 oozie-site.xml 中,设置oozie.validate.ForkJoin为false

9 Berlin | 2014.01.14 | Teng Qiu

Page 10: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的核心组件 Action Node 任务节点

● HDFS

●move, delete, mkdir, chmod, touchz, DistCp

●MapReduce

● job.xml 指定M/R的class和目录

● Pig / Hive

● <job-xml>hive-site.xml</job-xml>

● <script>${hiveScript}</script>

● SSH

● public key !!! 一声叹息啊

● <host>, <command>, <args> -_-

● Sub Workflow

● <propagate-configuration/>

10 Berlin | 2014.01.14 | Teng Qiu

Page 11: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

● Sqoop Action 比较让人崩溃

11

OOZIE的核心组件 Action Node 任务节点

Berlin | 2014.01.14 | Teng Qiu

Page 12: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的核心组件 Action Node 任务节点

● Java Action

● <main-class>

● <arg>

● <capture-output />

● ${wf:actionData('action-node-name')['property-name']}

12 Berlin | 2014.01.14 | Teng Qiu

String oozieProp = System.getProperty("oozie.action.output.properties");

if (oozieProp != null) {

Properties props = new Properties();

props.setProperty(propKey, propVal);

File propFile = new File(oozieProp);

OutputStream os = new FileOutputStream(propFile);

props.store(os, "Results from oozie task");

os.close();

}

Page 13: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的核心组件 Action Node 任务节点

●自定义Action

●实现 ActionExecutor 接口

●构造函数 super(ACTION_TYPE)

● ActionExecutor.Context

● start / end / kill / check

●修改 oozie-site.xml

●添加自定义类名到属性

● oozie.service.ActionService.executor.ext.classes

●或许可以给 Impala 写一个?

13 Berlin | 2014.01.14 | Teng Qiu

Page 14: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS 情景描述

● Oozie实战及Tips

●典型的DMP(Data Management Platform)ETL应用

●对用户行为进行聚合,对用户进行归类

●中间表:内部用户归类:A -> 1,7,2,3,9,8 | B -> 4,3,2

内部用户ID外部用户标识对应表: A -> A1 | B -> B1

14 Berlin | 2014.01.14 | Teng Qiu

用户 时间 商品

A 101 XXX

B 102 YYY

A 103 ZZZ

用户行为表1..n,TTL=30天

商品 归属类别

XXX 1,2,3

YYY 4,3,2

ZZZ 7,9,8

商品分类表

外部用户标识

归类 Genera

tion

A1 1,7,2,

3,9,8

0

B1 4,3,2 0

最终结果

Page 15: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS

15 Berlin | 2014.01.14 | Teng Qiu

ZKClient getGen and checkTime

1) get old and new generation

2) compare lastImportedTime vs.

lastExportedTime

decision

is there new

data?

START

Point

fork

END

Successful

join

KILLED

with ERROR

coprocessor Client

aggregate X events

coprocessor Client

aggregate Y events

coprocessor Client

aggregate Z events

E-Mail Client *

MSG: nothing to export

ZKClient-fail-after-coproc

set generation back

ZKClient-setGen

set new generation

Hive Script

to generate

export table

E-Mail Client *

MSG: failed after coproc

E-Mail Client *

MSG: failed by ZK Client

Error

Error Error

Error

Error

Error

Yes

No

fork join

Hive/FTP Script

to create/send export

files for A

Hive/FTP Script

to create/send export

files for B

Hive/FTP Script

to create/send export

files for C

Page 16: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS 所涉及的action

● ETL场景,DMP数据聚合,处理,导出

● decision / fork-join

● Java(HBase,ZooKeeper )

● Hive

● E-Mail

16 Berlin | 2014.01.14 | Teng Qiu

Page 17: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS 万里长征第一步 – 运行

● Oozie的使用

●命令行

● Java Client API / REST API

● Hue

● ShareLib

● /usr/lib/oozie/oozie-sharelib.tar.gz

● sudo -u oozie hadoop fs -put share /user/oozie/

●在job.properties中,oozie.use.system.libpath=true

● oozie.service.WorkflowAppService.system.libpath

● oozie.libpath=${nameNode}/xxx/xxx/jars

17 Berlin | 2014.01.14 | Teng Qiu

$ oozie job -oozie http://fxlive.de:11000/oozie -config /some/where/job.properties –run

$ oozie job -oozie http://fxlive.de:11000/oozie -info 0000001-130104191423486-oozie-oozi-W

$ oozie job -oozie http://fxlive.de:11000/oozie -log 0000001-130104191423486-oozie-oozi-W

$ oozie job -oozie http://fxlive.de:11000/oozie -kill 0000001-130104191423486-oozie-oozi-W

jobTracker=xxx:8021

nameNode=xxx:8020

oozie.coord.application.path=${workflowRoot}/coordinator.xml

oozie.wf.application.path=${workflowRoot}/workflow.xml

Page 18: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS 运行不了?

●权限问题

● Error: E0902 : E0902: Exception occured:

[org.apache.hadoop.ipc.RemoteException: User: oozie is not allowed to

impersonate xxx]

● core-site.xml中设置

● hadoop.proxyuser.oozie.groups

● hadoop.proxyuser.oozie.hosts

● ForkJoin的bug

● Error: E0735 : E0735: There was an invalid "error to" transition to node [xxx]

while using fork/join

● OOZIE-1142

● oozie-site.xml中设置oozie.validate.ForkJoin为false

18 Berlin | 2014.01.14 | Teng Qiu

Page 19: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS HBase用起来问题多多?

● hbase-site.xml

● oozie不支持HBase,所以不会知道hbase的zookeeper设置等等

●如果你不幸要使用sqoop + hbase

●在sharelib中 /lib/sqoop/ 下的hbase-xxx.jar

●替换jar包中的hbase-site.xml!?

●将hbase-site.xml通过hadoop fs put到oozie/share/lib/sqoop/

19 Berlin | 2014.01.14 | Teng Qiu

Configuration conf = new Configuration();

conf.addResource("hbase-site.xml");

conf.reloadConfiguration();

Page 20: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS Hive各种报错

●每个hive action node都必须通过 <job-xml> 指定 hive-site.xml

● FAILED: Error in metadata

● NestedThrowables: JDOFatalInternalException 或 InvocationTargetException

●MetaStore所使用数据库的driver

●如MySQL Java Connector,mysql-connector-java-xxx-bin.jar是否在workflow中的lib目录下

●目录权限

● Hive的warehouse和tmp目录权限,对启动oozie任务必须是的可写

●如果要整合HBase

● hive-site.xml 中的 auxpath,zookeeper设置

20 Berlin | 2014.01.14 | Teng Qiu

Page 21: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS TIP:全局属性

● global的properties和job-xml

21 Berlin | 2014.01.14 | Teng Qiu

<workflow-app name=“xxx">

<global>

<job-xml>${hiveSite}</job-xml>

<configuration>

<property>

<name>mapred.child.java.opts</name>

<value>-Xmx2048m</value>

</property>

<property>

<name>oozie.launcher.mapred.child.java.opts</name>

<value>-server -Xmx2G -Djava.net.preferIPv4Stack=true</value>

</property>

</configuration>

</global>

...

Page 22: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS TIP:全局属性

●属性检查、替换

22 Berlin | 2014.01.14 | Teng Qiu

<workflow-app name="">

<parameters>

<property>

<name>current_month</name>

</property>

<property>

<name>currentDate</name>

<value>${concat(concat("'", wf:conf('current_date')), "'")}</value>

</property>

<property>

<name>dateFrom</name>

<value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-01'))), "'")}</value>

</property>

<property>

<name>dateTo</name>

<value>${concat(concat("'", firstNotNull(wf:conf('current_date'), concat(wf:conf('current_month'), '-31'))), "'")}</value>

</property>

</parameters>

...

如果current_month变量未指定,将报错Error: E0738

如果current_date变量未指定,此处将设为 ''

Page 23: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS TIP:变量名和变量的使用

●符合命名规则的变量( [A-Za-z_][0-9A-Za-z_]* )

● ${xxx} 或 wf:conf(xxx)

● ${wf:conf("etl_only_do_something") eq "yes"}

●不符合命名规则的变量(input.path)

● ${input.path}

●不能有减号~

●不能写成 input-path

23 Berlin | 2014.01.14 | Teng Qiu

Page 24: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS 工作流运行中对KPI值的收集

●MapReduce action / Pig action

● hadoop:counters

● ${hadoop:counters("mr-node-name")["FileSystemCounters"]["FILE_BYTES_READ"]}

● Java / SSH action

● <capture-output />

● ${wf:actionData('java-action-node-name')['property-name']}

● ${wf:action:output('ssh-action-node-name')['property-name']}

● Hive 没有好的办法

● hive –e –S

24 Berlin | 2014.01.14 | Teng Qiu

Page 25: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS Java Action 传递输出数据回oozie

● Java的输出作为变量

● <capture-output />

●程序中写Properties

25 Berlin | 2014.01.14 | Teng Qiu

String oozieProp = System.getProperty("oozie.action.output.properties");

if (oozieProp != null) {

Properties props = new Properties();

props.setProperty(“last.import.date”, “2013-12-01T00:00:00Z”); // ISO-8601 date format

File propFile = new File(oozieProp);

OutputStream os = new FileOutputStream(propFile);

props.store(os, "Results from oozie task");

os.close();

}

Page 26: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS Java Action 输出数据的使用

● Java

●使用时可作为main函数的参数传入

●或用在decision中

26 Berlin | 2014.01.14 | Teng Qiu

Page 27: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE实战及TIPS 收集输出变量也是有风险滴

● Oozie Action的输出数据有个默认的大小限制,只有2K!

●修改 oozie-site.xml

●设置成1M

●然后。。。要重启oozie

27 Berlin | 2014.01.14 | Teng Qiu

Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048]

Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null

org.apache.oozie.action.hadoop.LauncherException

at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(LauncherMapper.java:571)

<property>

<name>oozie.action.max.output.data</name>

<value>1048576</value>

</property>

Page 28: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

OOZIE的编程接口介绍

● Oozie的编程接口介绍

● Oozie Web Services API

● HTTP REST API

● curl -X POST -H "Content-Type: application/xml" -d @config.xml "http://localhost:11000/oozie/v1/jobs?action=start"

● Oozie Java client API

28 Berlin | 2014.01.14 | Teng Qiu

import org.apache.oozie.client.OozieClient;

new OozieClient(String oozie_url)

create Properties Object

String jobId = oozieClient.run(Properties prop)

org.apache.oozie.client.WorkflowJob

WorkflowJob job = oozieClient.getJobInfo(String jobID);

Page 29: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

图形化开源ETL工具KETTLE

● Oozie的些许限制

● Hadoop集群内部

● HBase咋办

●支持Oozie的图形化开源ETL工具Kettle初探

● Job / Transformation

● HBase Input / Output

29 Berlin | 2014.01.14 | Teng Qiu

Page 30: Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study

总结展望

●总结展望

●作为hadoop集群内cron job的有效替代者

●与Hadoop结合紧密,可统一进行用户权限管理

●工作流节点的错误报警和处理(rerun)

●可通过流程控制节点对工作流进行灵活控制

●与Azkaban相比,支持的任务种类更多

●但是是有所牺牲的,始终占用一个map slot

●与Azkaban相比,支持变量及EL语言

● coordinator提供事件触发式的启动模式

● API丰富

●不支持HBase

●要费劲写XML

30 Berlin | 2014.01.14 | Teng Qiu