94
丁来强 [email protected] 1

BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected] 1

Page 2: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Python高效大数据工作流与任务调度

[email protected]

丁来强 (LaiQiang Ding)

Page 3: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AboutMe

• Fartherofa4years’boy

3

Page 4: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AboutMe

•Workedfor10+years.•@Splunk

4

Page 5: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Agenda

• Background• Definition• RoleinDataInfra

• Requirement• Problem• Challenges• Requirement

• Solutions• Overview• Luigi• Airflow

• Demo

5

Page 6: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Youwilllearn:

• Roleofworkflowschedulerfordataengineeringinecosystem.• Challengesandkeyrequirements.• Solutionsandgeneraldifferences.• Architecture,designandpracticesofusingAirflowandLuigiinPython• Pitfallsandcommonpatternsindesigntouseaworkflowscheduler

6

Page 7: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Definition

Page 8: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Definition

BigDataWorkflowScheduler

Scheduleandmanagedependenciesofworkflowofjobsindatainfrastructure,mainlyusedinofflineandnear-linesystem.

8

Page 9: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

BigDataWork-flowScheduler

Work-flow&Dependency

Jobs&Tasks

BigDataSystems

Scheduler

9

Page 10: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Differentwithbelowcategories:

• BPM• LikeActiviti

• Middlewareworkflow&SOA• LikeAWSSimpleWorkflow

• PureDataDrivenPipeline/APIforDevelopment• LikeApacheCrunch,ApacheCascading,AWSDataPipeline,AzureDataFactory

• Pure StreamingProcess• LikeStorm,SparkStreaming

10

Page 11: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

RoleinDataInfra

Page 12: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Hadoop2.0

12

Page 13: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Hadoop2.0

13

Page 14: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirbnbDataInfra

14

Page 15: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

LinkedInDataInfra

http://www.slideshare.net/amywtang/linkedin-bigdata-yaleoct2012final15

Page 16: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

LinkedInDataInfra

http://www.slideshare.net/amywtang/linkedin-bigdata-yaleoct2012final16

Page 17: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

DataofworkflowschedulerinBigData

• 14boxes dedicatedforwork-flowsystem• 8,000tasksdaily

•Maintain3instancesofwork-flowsystem• 2,500flows,30,000jobsdaily

• 2000+tasks,10,000+Hadoopjobsdaily

Airflow

17

Page 18: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

What’sthemostimportantforaBigdataworkflowscheduler?

Page 19: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected] 19

DeadSimple:- Easytouseandconfigure

Page 20: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Problemswithbigdatajobscheduling

Page 21: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TypicalChallenge

•数据工作流程复杂度越来越高•数据分析与批处理数据非常重要•大量时间花费在编写任务、检测与排错上

21

Page 22: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Fragileprocess

Page 23: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Fragileprocess

JobA

JobB

JobC

PushtoProduction

23

Page 24: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Fragileprocess

JobA

JobB

JobC

JobX

PushtoProduction

24

Page 25: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Fragileprocess

JobA

JobB

JobC

JobX

PushtoProduction PushtoQAA/BTesting

25

Page 26: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Fragileprocess

JobA

JobB

JobC

JobX

PushtoProduction PushtoQAA/BTesting

AlertWhenfailure

26

Page 27: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Example:NetflixRecommendationSystem

27

Page 28: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

FragileFailureHanding

Page 29: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected] 29

2. Jobfailsduetosystemornetworkmaynotbetemporarilynotavailable

1. Scheduledtriggersareskippedduetounavailabilityofsub-system

3. Someerrorsorbugsmayexistinsomejobs’logic

4. Performanceisslowespeciallyforsomecriticalsteps

29

Page 30: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Requirement

Page 31: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

BasicNeeds

FailureTolerance&Backfill

CalendarBasedScheduling

LogAccess&Monitoring Notification

Work-flow&Dependency JobDefinition

31

Page 32: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AdvancedNeeds

Scalability SLAMonitor&Alert

ComplexRule OperatorOOB

HighAvailability

Programmatic

32

Page 33: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AdvancedNeeds(cont’)

Queue(Affinity)

P O O L

Pool(Limitconcurrency+priority)

DataProfiling

Plugins

Versioning

EventDrivenScheduler

33

Page 34: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

SolutionOverview

Page 35: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Options

Airflow

35

Page 36: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

SolutionOverview

Basic Info Luigi Airflow Azkaban OozieLanguage Python Python Java JavaGithub Stars 5,274 3,422 780 354Contributors 256 178 37 18LatestVersion 2.3.1 1.7.1 3.1 4.2History 4years 1+years 6+years 6+yearsInvented by Spotify Airbnb LinkedIn YahooOwned by Spotify Apache

IncubatorApache Apache

36

Page 37: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Azkaban

• Pros:• BornforHadoop• SupportallHadoop,hive,pigversions

• EasytouseWebUI:• GoodJobvisualizationandmonitoring

• FlexibleModulestructure/Plugins• Cons:• Properties files based configuration• WebUIonly,NoCLIandRESTinterfaces(need3rd partyAzkabanCLI)• Limitedexecutionpathcontrol

37

Page 38: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AzkabanGUI

38

Page 39: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Oozie

• Pros:• BornforHadoop• CLI,HTTP,JAVAAPIinterfaces• SupportextendedAlertintegration

• Cons:• Higherlearningcurve• PDLstyleXML basedconfiguration• LimitedWebUI(needClouderaHue)• Noresourcecontrol

39

Page 40: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Luigi

Page 41: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Overview

• Pros:• ProgrammaticbyPython• Modelingissimple,Codeismature(~20KLOC)• GoodsupportHadoop(MR,logs,dist)• Testfriendly,supportlocalscheduler

• Cons:• WebUIisverylimited• Nobuilt-intrigger(needcron)• Notdesignforlargescaling(>100Ktasks)• Nosupportdistributionofexecution

41

Page 42: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TaskDefinition

OutputoftheTask:ReturnoneormoreTargets

SetupDependencies:ReturnoneormoreTasks

Logic

42

Page 43: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TaskExample

43

Page 44: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TaskExecution

44

Page 45: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Architecture

Alsoawebserver

45

Page 46: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

ArchitectureNotes

•Mainlymanagethedependencyandde-dupthetaskrunning.•MainlyfocusondatapipelineETL.• Limitations• Nocalendartrigger• WebUIisverylimited• Toocouplebetweenworkerandscheduler(notsupport>100Ktasks)• Executionisbundledonspecificworker

46

Page 47: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– executionstatus

47

Page 48: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– DAGvisualization

48

Page 49: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TaskandTargetsLibrary

• GoogleBigquery• Hadoopjobs• Hivequeries• Pigqueries• Scaldingjobs• Sparkjobs• Postgresql,Redshift,Mysql tables• andmore…

49

Page 50: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Airflow

Page 51: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Overview

• Pros(wewillsee):• MoreGeneralFlexibleArchitecture• VerycompellingWebUI• LotsofcoolfeaturesOOB,RichOperatorlibrary• Fastgrowingadoption(30+companies)• Testfriendly(testmodeandSequentialScheduler)

• Cons:• Codingqualityisnotsomature(UTcoverageisnothigh)• Noeventdrivenscheduler(sametoallotherssolutions)

51

Page 52: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirflowTechStack

• PythonCode(<20KLOC)• DB:SqlAlchemy• Celeryfordistributedexecution• WebServer:Flask/gunicorn• UI:d3.js/Highcharts /Pandas• Templating:Jinjia2

52

Page 53: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirflowWeb

LogAccess&MonitoringWork-flow&Dependency DataProfiling

Page 54: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– Overallstatus

54

Page 55: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– workflowvisualization

55

Page 56: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– executionhistory

56

Page 57: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– performanceprofile

57

Page 58: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– Performancestatsovertime

58

Page 59: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WebUI– Deepdivefortaskexecution

59

Page 60: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirflowConcepts

Work-flow&Dependency OperatorOOBProgrammatic

Page 61: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

DAG(DirectedAcyclicGraph)

DAG:acollectionoftasksw/schedulingsettings

Task:aninstanceofBashOperatorSupporttemplating

Setupthedependencies

AntaskofanotherkindofPythonOperator

61

Page 62: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

DAGexecution

Dag1Run(2016-9-1)Dag1

Task1 Task3

Task2

Dag1Run(2016-9-2)

Task1Instance(2016-9-1)

Task3Instance(2016-9-1)

Task1Instance(2016-9-1)

HiveOperator

PigOperator

PythonOperatorHiveHook

PigHook

Dag1Run(2016-9-3)

62

Page 63: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Concepts– DAG,DAGRun

•DAG• AcollectionofTasks• SettingofCalendarScheduling

•DagRun• AruninstanceofDAGwithascheduleddate(ID:dag,starttimeandinterval)

63

Page 64: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Concepts– Operator,TaskandTI

•Operator• Tasktemplates

• Task• InstanceofaOperator

• TaskInstance(TI)• BelongtoDagRun• AruninstanceofaTaskwithascheduleddate(id:dag,task,starttimeandinterval)

64

Page 65: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Concepts- Operator

•Operator• Tasktemplates,generalcategories:• Sensor• Branching• Transformer

• SettingsofTriggerRules,retryetc.• UseHookforrealoperationw/externalsystems

65

Page 66: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

OperatorLibrary

• GoogleBigquery,CouldStorage• AWSS3,EMR• SparkSQL• Docker• Presto• Sqoop• Hivejobs• Vertica• Qubole• SSH• Hipchat,Slack,Email• Postgresql,Redshift,Mysql,Oracleetc.• andmore…

66

Page 67: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

ParameterizedTasks

• Variables• Globalparameters

• Connections• Externalsystem’sconnectionstring,confidential,extraparametersetc.NormallyusedbyHook.

• DAGParameters/Macros• Templating• UsingJinjia forbatchoranyplacesthatfit

• Xcom• SharedatabetweenTasks

67

Page 68: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

ArchitectureInsight

Scalability HighAvailability Versioning

Page 69: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirflowArchitecture(LocalScheduler)

Scheduler

Hive

HDFS

MySQL

Cascading

Spark

Presto

MetadataDatabase

WorkersWorkersWorkersWorkersWorkersWorkers

WebServersWebServersWebServers

69

Invoke

Page 70: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

LocalScheduler– w/versioncontrol

MasterRepo

CodeRepoScheduler

WorkersWorkersWorkersWorkersWorkersWorkers

WebServersWebServersWebServersHive

HDFS

MySQL

Cascading

Spark

Presto

MetadataDatabase

70

Page 71: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirflowArchitecture(CeleryScheduler)

Scheduler

Hive

HDFS

MySQL

Cascading

Spark

Presto

MetadataDatabase

WorkersWorkersWorkersWorkersWorkersWorkers

Brokers(MQ)

StateStore

WebServersWebServersWebServers

71

Page 72: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

CeleryScheduler– w/versioncontrol

CodeRepo

Scheduler

WebServers

MetadataDatabase

Workers

Hive

HDFS

MySQL

Cascading

Spark

Presto

WorkersWorkersWorkersWorkersWorkers

WebServersWebServers

Brokers(MQ)

StateStore

MasterRepo

72

Page 73: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

AirflowArchitecture- HA

Scheduler

Hive

HDFS

MySQL

Cascading

Spark

Presto

MetadataDatabase

WorkersWorkersWorkersWorkersWorkersWorkers

Brokers(MQ)

StateStore

WebServersWebServersWebServers

SchedulerSlave

73

Page 74: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

WorkflowPatterns

ComplexRule

Page 75: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Processinparallel

75

Page 76: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Switch

76

Page 77: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Sub Dag

• Easier to control, re-use and test• Just like acomponentin code

77

Page 78: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TriggerRule– allsuccess

Triggeredby“all_success”

78

Page 79: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

TriggerRule– onesuccess

Triggeredby“one_success”

79

Page 80: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

SchedulingPractice

CalendarBasedScheduling

Page 81: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Calendarbasedscheduling(UTC)

TimezoneisalwaysUTC

Calendar

81

Page 82: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Scheduler– intervalinworkflow

EveryDagRunwillonlystartwhennextDagRun’sexecution

timemeets

Note

Runat04:00:0082

Page 83: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Scheduler– recursiverunning

Run1(09-0100:00)

Run2(09-0104:00)

Whatifoutput_file inRun1 andRun2 impacteachothers?

TrybesttoavoidthiskindofdesignIdea

83

Page 84: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

PrinciplewhendefiningTask

EachTaskshouldbeatomic- isolationfromconcurrentprocessing- Eithersucceedorfailure,nogreystate- Failurewillnotimpactthesystem

Atomic

Taskgranularityshouldbeproper- Choose“Rightsize”foronetask- Taskshouldexecutesimultaneously

Granularity

84

Page 85: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Idempotent Task

Cleanupenvwhenfailure

85

Page 86: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

It’sidealcase,inrealcases…

Page 87: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Scheduler– recursivedependency

Run1(09-0100:00)

Run2(09-0104:00)

Whatifread_file inRun1 andRun2cannotruninparallelduetoexternalsystem’slimitation

Turnonoption“depends_on_past“fortaskread_fileOption2

Assignapoolwith1 Slotstofortaskread_fileOption1

87

Page 88: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

ResourceControl

Queue(Affinity)

P O O L

Pool(Limitconcurrency+priority)

88

Page 89: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Scheduler– morerecursivedependency

Run1(09-0100:00)

Run2(09-0104:00)

Whatifread_file inRun2 repliesonoutput_file inRun1duetorestrictionornecessarystateful design?

Turnonoption“wait_for_downstream“fortaskread_file(Thiswillforcetoturnon“depends_on_past”)

Option

89

Page 90: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Scheduler– recursivedependencypitfall

start_date andschedule_interval shouldbealigned

2016-09-0800:00:00isaligned2016-09-0802:00:00 isNOTaligned

ThiswillmaketheDAGfailureiftheoption“depends_on_past“isturnedon

Note

‘@once’justonetime

90

Page 91: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Someothernotes

• Updatethedagidwhenchangingthelogicinside• UsingSLAalertforcriticaltasks

• Featureinplan:• EventDrivenScheduler• Mesos Scheduler• Moreoperators• Moresyntaxsugar

91

Page 92: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Demo

Page 93: BigDataWorkflow - pic.huodongjia.compic.huodongjia.com › ganhuodocs › 2017-06-17 › 1497682292.04.pdfData of workflow scheduler in Big Data •14 boxesdedicated for work-flow

丁来强 [email protected]

Nowyou’velearned:

• Definitionandecosystem.• Challengesandkeyrequirements.• Solutionsandgeneralcomparisons.• MostimportantpartofAirflowandLuigi• Architecture,design,patterns,pitfallsandpracticesetc.

93