The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

  • View

  • Download

Embed Size (px)

Text of The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Self-Serve Performance Tuning for Hadoop & Spark The Fifth Elephant 2016

Self-Serve Performance Tuning for Hadoop & Spark

The Fifth Elephant 2016Akshay RaiEngineer, Hadoop Development TeamLinkedin

Dr. Elephant

2016 LinkedIn Corporation. All Rights Reserved.

Hadoop @ Linkedin c. 20081 cluster20 nodes10 users10 workflows in productionMapReduce, Pig2

Hadoop @ Linkedin c. 2016> 10 clusters> 10000 nodes> 1000 users

Thousands of queries and flows in developmentHundreds running in ProductionMapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert3

Scaling Hadoop InfrastructureAdd extra machines to the clusterHadoop is scalable but not that optimal!We cannot keep adding machines foreverTune given resources and minimize addition of new machines4

Measuring performanceHighlights hardware failures and poor performing componentsScope for environment upgrades.


Cluster Level Performance TuningJob Level Performance Tuning


How difficult is it to tune a Job?Production Gatekeeper - Let jobs go into production only after verifying it is tuned.Restriction! More questions on how to tune! Spend more resources helping people.

Heres what we tried to achieve Job tuning!7

Challenges in tuning a jobHadoop is designed to let users tune their jobs BUT!One cannot optimize if one doesnt understand the internals of the frameworkCritical information is scatteredHadoop has a huge set of parameters, tuning some may impact other8

You cannot tune what you do not know & you cannot improve what you cannot measure


Training Sessions


More people, more frequent sessions.Hadoop experience varies with peopleFramework specific training. Pig, hive, etc

Training - Doesnt Scale11

Expert Review


Expert Review - Also Doesnt WorkAgain not scalableCannot ensure job is performing optimally, no easy comparison.Different people, different perspective, no consensusError prone, one might overlook certain aspects.13

Scaling Hadoop Infrastructure is HARDScaling User Productivity is much HARDER


Birth of Dr. Elephant


What does Dr. Elephant do?Help every user get the best performance from their jobsAnalyse and compare historical executionsProvides a platform for other performance related tools16



Rule #1 : Mapper Data Skew18

Mapper Skew ProblemVarying size of splits can cause skewness in the Mapper Input


Solution to Mapper SkewnessEach Mapper should process the same amount of dataCombine the small chunks and feed it to a single Mapper


Rule #2 : Mapper Memory21

Mapper Memory Problem & SolutionRequested Container Memory >> Tasks Consumed MemoryRequest 4 GB of containerActually job uses only 512 MBWait longer to get 4 GB and then block 4GB of resources!Request a lower container memory by setting reduce).memory.mb




MapReduce Report


Job History


How to define a rule?26

How does a Rule work?INPUTCounters & Task DataLOGICSome logic to compute a valueOUTPUTCompare value against threshold levels


Customising Dr. Elephant


Adding a Custom RuleCreate a new Rule and test it.Create a help page defining the rule, parameters to tune etc.Add the details of the Rule in the HeuristicConf.xml file Mapreduce Rule Name

Run Dr. Elephant. It should now include the new rules.29

What else can you customize?Rules, set threshold levelsEasily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)Enable/disable and extend to new FetchersExtend to newer application types and job types30

Production Gatekeeper


Automated Production Reviews | JIRA BotCluster for critical workloadsAudit before deployment32

Workflow monitoring and reportsMonitor performance on each executionCompare behaviour across revisionsCost to Serve analysis


Open Source, April 2016 / linkedin / dr-elephant34

WatchersStars Forks 60 262 109

Lets collectively contribute!


Pull Requests 60 +Contributors 10 + User Topics 50 +

Dr. Elephant Community


Coming Soon37Real time analysis of JobsAnalytics for Failed JobsVisualizing Workflows through DAGsSupport for Other schedulers and Frameworks

ReferencesEngineering Blog: Source Github List & Gitterdr-elephant-users, linkedin/dr-elephant

Hadoop Summit 2015: (Mark Wagner)38 / linkedin / dr-elephant

Thank You

39Akshay Rai


2014 LinkedIn Corporation. All Rights Reserved.2014 LinkedIn Corporation. All Rights Reserved.