Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
丁来强 [email protected] 1
Agenda
• Background• Definition• RoleinDataInfra
• Requirement• Problem• Challenges• Requirement
• Solutions• Overview• Luigi• Airflow
• Demo
5
Youwilllearn:
• Roleofworkflowschedulerfordataengineeringinecosystem.• Challengesandkeyrequirements.• Solutionsandgeneraldifferences.• Architecture,designandpracticesofusingAirflowandLuigiinPython• Pitfallsandcommonpatternsindesigntouseaworkflowscheduler
6
Definition
Definition
BigDataWorkflowScheduler
Scheduleandmanagedependenciesofworkflowofjobsindatainfrastructure,mainlyusedinofflineandnear-linesystem.
8
BigDataWork-flowScheduler
Work-flow&Dependency
Jobs&Tasks
BigDataSystems
Scheduler
9
Differentwithbelowcategories:
• BPM• LikeActiviti
• Middlewareworkflow&SOA• LikeAWSSimpleWorkflow
• PureDataDrivenPipeline/APIforDevelopment• LikeApacheCrunch,ApacheCascading,AWSDataPipeline,AzureDataFactory
• Pure StreamingProcess• LikeStorm,SparkStreaming
10
RoleinDataInfra
LinkedInDataInfra
http://www.slideshare.net/amywtang/linkedin-bigdata-yaleoct2012final15
LinkedInDataInfra
http://www.slideshare.net/amywtang/linkedin-bigdata-yaleoct2012final16
DataofworkflowschedulerinBigData
• 14boxes dedicatedforwork-flowsystem• 8,000tasksdaily
•Maintain3instancesofwork-flowsystem• 2,500flows,30,000jobsdaily
• 2000+tasks,10,000+Hadoopjobsdaily
Airflow
17
What’sthemostimportantforaBigdataworkflowscheduler?
丁来强 [email protected] 19
DeadSimple:- Easytouseandconfigure
Problemswithbigdatajobscheduling
Fragileprocess
Fragileprocess
JobA
JobB
JobC
JobX
PushtoProduction PushtoQAA/BTesting
AlertWhenfailure
26
FragileFailureHanding
丁来强 [email protected] 29
2. Jobfailsduetosystemornetworkmaynotbetemporarilynotavailable
1. Scheduledtriggersareskippedduetounavailabilityofsub-system
3. Someerrorsorbugsmayexistinsomejobs’logic
4. Performanceisslowespeciallyforsomecriticalsteps
29
Requirement
BasicNeeds
FailureTolerance&Backfill
CalendarBasedScheduling
LogAccess&Monitoring Notification
Work-flow&Dependency JobDefinition
31
AdvancedNeeds
Scalability SLAMonitor&Alert
ComplexRule OperatorOOB
HighAvailability
Programmatic
32
AdvancedNeeds(cont’)
Queue(Affinity)
P O O L
Pool(Limitconcurrency+priority)
DataProfiling
Plugins
Versioning
EventDrivenScheduler
33
SolutionOverview
SolutionOverview
Basic Info Luigi Airflow Azkaban OozieLanguage Python Python Java JavaGithub Stars 5,274 3,422 780 354Contributors 256 178 37 18LatestVersion 2.3.1 1.7.1 3.1 4.2History 4years 1+years 6+years 6+yearsInvented by Spotify Airbnb LinkedIn YahooOwned by Spotify Apache
IncubatorApache Apache
36
Azkaban
• Pros:• BornforHadoop• SupportallHadoop,hive,pigversions
• EasytouseWebUI:• GoodJobvisualizationandmonitoring
• FlexibleModulestructure/Plugins• Cons:• Properties files based configuration• WebUIonly,NoCLIandRESTinterfaces(need3rd partyAzkabanCLI)• Limitedexecutionpathcontrol
37
Oozie
• Pros:• BornforHadoop• CLI,HTTP,JAVAAPIinterfaces• SupportextendedAlertintegration
• Cons:• Higherlearningcurve• PDLstyleXML basedconfiguration• LimitedWebUI(needClouderaHue)• Noresourcecontrol
39
Luigi
Overview
• Pros:• ProgrammaticbyPython• Modelingissimple,Codeismature(~20KLOC)• GoodsupportHadoop(MR,logs,dist)• Testfriendly,supportlocalscheduler
• Cons:• WebUIisverylimited• Nobuilt-intrigger(needcron)• Notdesignforlargescaling(>100Ktasks)• Nosupportdistributionofexecution
41
TaskDefinition
OutputoftheTask:ReturnoneormoreTargets
SetupDependencies:ReturnoneormoreTasks
Logic
42
ArchitectureNotes
•Mainlymanagethedependencyandde-dupthetaskrunning.•MainlyfocusondatapipelineETL.• Limitations• Nocalendartrigger• WebUIisverylimited• Toocouplebetweenworkerandscheduler(notsupport>100Ktasks)• Executionisbundledonspecificworker
46
TaskandTargetsLibrary
• GoogleBigquery• Hadoopjobs• Hivequeries• Pigqueries• Scaldingjobs• Sparkjobs• Postgresql,Redshift,Mysql tables• andmore…
49
Airflow
Overview
• Pros(wewillsee):• MoreGeneralFlexibleArchitecture• VerycompellingWebUI• LotsofcoolfeaturesOOB,RichOperatorlibrary• Fastgrowingadoption(30+companies)• Testfriendly(testmodeandSequentialScheduler)
• Cons:• Codingqualityisnotsomature(UTcoverageisnothigh)• Noeventdrivenscheduler(sametoallotherssolutions)
51
AirflowTechStack
• PythonCode(<20KLOC)• DB:SqlAlchemy• Celeryfordistributedexecution• WebServer:Flask/gunicorn• UI:d3.js/Highcharts /Pandas• Templating:Jinjia2
52
DAG(DirectedAcyclicGraph)
DAG:acollectionoftasksw/schedulingsettings
Task:aninstanceofBashOperatorSupporttemplating
Setupthedependencies
AntaskofanotherkindofPythonOperator
61
DAGexecution
Dag1Run(2016-9-1)Dag1
Task1 Task3
Task2
Dag1Run(2016-9-2)
Task1Instance(2016-9-1)
Task3Instance(2016-9-1)
Task1Instance(2016-9-1)
HiveOperator
PigOperator
PythonOperatorHiveHook
PigHook
Dag1Run(2016-9-3)
62
Concepts– DAG,DAGRun
•DAG• AcollectionofTasks• SettingofCalendarScheduling
•DagRun• AruninstanceofDAGwithascheduleddate(ID:dag,starttimeandinterval)
63
Concepts– Operator,TaskandTI
•Operator• Tasktemplates
• Task• InstanceofaOperator
• TaskInstance(TI)• BelongtoDagRun• AruninstanceofaTaskwithascheduleddate(id:dag,task,starttimeandinterval)
64
Concepts- Operator
•Operator• Tasktemplates,generalcategories:• Sensor• Branching• Transformer
• SettingsofTriggerRules,retryetc.• UseHookforrealoperationw/externalsystems
65
OperatorLibrary
• GoogleBigquery,CouldStorage• AWSS3,EMR• SparkSQL• Docker• Presto• Sqoop• Hivejobs• Vertica• Qubole• SSH• Hipchat,Slack,Email• Postgresql,Redshift,Mysql,Oracleetc.• andmore…
66
ParameterizedTasks
• Variables• Globalparameters
• Connections• Externalsystem’sconnectionstring,confidential,extraparametersetc.NormallyusedbyHook.
• DAGParameters/Macros• Templating• UsingJinjia forbatchoranyplacesthatfit
• Xcom• SharedatabetweenTasks
67
AirflowArchitecture(LocalScheduler)
Scheduler
Hive
HDFS
MySQL
Cascading
Spark
Presto
…
MetadataDatabase
WorkersWorkersWorkersWorkersWorkersWorkers
WebServersWebServersWebServers
69
Invoke
LocalScheduler– w/versioncontrol
MasterRepo
CodeRepoScheduler
WorkersWorkersWorkersWorkersWorkersWorkers
WebServersWebServersWebServersHive
HDFS
MySQL
Cascading
Spark
Presto
…
MetadataDatabase
70
AirflowArchitecture(CeleryScheduler)
Scheduler
Hive
HDFS
MySQL
Cascading
Spark
Presto
…
MetadataDatabase
WorkersWorkersWorkersWorkersWorkersWorkers
Brokers(MQ)
StateStore
WebServersWebServersWebServers
71
CeleryScheduler– w/versioncontrol
CodeRepo
Scheduler
WebServers
MetadataDatabase
Workers
Hive
HDFS
MySQL
Cascading
Spark
Presto
…
WorkersWorkersWorkersWorkersWorkers
WebServersWebServers
Brokers(MQ)
StateStore
MasterRepo
72
AirflowArchitecture- HA
Scheduler
Hive
HDFS
MySQL
Cascading
Spark
Presto
…
MetadataDatabase
WorkersWorkersWorkersWorkersWorkersWorkers
Brokers(MQ)
StateStore
WebServersWebServersWebServers
SchedulerSlave
73
Scheduler– intervalinworkflow
EveryDagRunwillonlystartwhennextDagRun’sexecution
timemeets
Note
Runat04:00:0082
Scheduler– recursiverunning
Run1(09-0100:00)
Run2(09-0104:00)
Whatifoutput_file inRun1 andRun2 impacteachothers?
TrybesttoavoidthiskindofdesignIdea
83
PrinciplewhendefiningTask
EachTaskshouldbeatomic- isolationfromconcurrentprocessing- Eithersucceedorfailure,nogreystate- Failurewillnotimpactthesystem
Atomic
Taskgranularityshouldbeproper- Choose“Rightsize”foronetask- Taskshouldexecutesimultaneously
Granularity
84
It’sidealcase,inrealcases…
Scheduler– recursivedependency
Run1(09-0100:00)
Run2(09-0104:00)
Whatifread_file inRun1 andRun2cannotruninparallelduetoexternalsystem’slimitation
Turnonoption“depends_on_past“fortaskread_fileOption2
Assignapoolwith1 Slotstofortaskread_fileOption1
87
Scheduler– morerecursivedependency
Run1(09-0100:00)
Run2(09-0104:00)
Whatifread_file inRun2 repliesonoutput_file inRun1duetorestrictionornecessarystateful design?
Turnonoption“wait_for_downstream“fortaskread_file(Thiswillforcetoturnon“depends_on_past”)
Option
89
Scheduler– recursivedependencypitfall
start_date andschedule_interval shouldbealigned
2016-09-0800:00:00isaligned2016-09-0802:00:00 isNOTaligned
ThiswillmaketheDAGfailureiftheoption“depends_on_past“isturnedon
Note
‘@once’justonetime
90
Someothernotes
• Updatethedagidwhenchangingthelogicinside• UsingSLAalertforcriticaltasks
• Featureinplan:• EventDrivenScheduler• Mesos Scheduler• Moreoperators• Moresyntaxsugar
91
Demo
Nowyou’velearned:
• Definitionandecosystem.• Challengesandkeyrequirements.• Solutionsandgeneralcomparisons.• MostimportantpartofAirflowandLuigi• Architecture,design,patterns,pitfallsandpracticesetc.
93