The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

  • View
    452

  • Download
    0

Embed Size (px)

Text of The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Self-Serve Performance Tuning for Hadoop & Spark The Fifth Elephant 2016

Self-Serve Performance Tuning for Hadoop & Spark

The Fifth Elephant 2016Akshay RaiEngineer, Hadoop Development TeamLinkedin

Dr. Elephant

2016 LinkedIn Corporation. All Rights Reserved.

Hadoop @ Linkedin c. 20081 cluster20 nodes10 users10 workflows in productionMapReduce, Pig2

Hadoop @ Linkedin c. 2016> 10 clusters> 10000 nodes> 1000 users

Thousands of queries and flows in developmentHundreds running in ProductionMapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert3

Scaling Hadoop InfrastructureAdd extra machines to the clusterHadoop is scalable but not that optimal!We cannot keep adding machines foreverTune given resources and minimize addition of new machines4

Measuring performanceHighlights hardware failures and poor performing componentsScope for environment upgrades.

5

Cluster Level Performance TuningJob Level Performance Tuning

6

How difficult is it to tune a Job?Production Gatekeeper - Let jobs go into production only after verifying it is tuned.Restriction! More questions on how to tune! Spend more resources helping people.

Heres what we tried to achieve Job tuning!7

Challenges in tuning a jobHadoop is designed to let users tune their jobs BUT!One cannot optimize if one doesnt understand the internals of the frameworkCritical information is scatteredHadoop has a huge set of parameters, tuning some may impact other8

You cannot tune what you do not know & you cannot improve what you cannot measure

9

Training Sessions

10

More people, more frequent sessions.Hadoop experience varies with peopleFramework specific training. Pig, hive, etc

Training - Doesnt Scale11

Expert Review

12

Expert Review - Also Doesnt WorkAgain not scalableCannot ensure job is performing optimally, no easy comparison.Different people, different perspective, no consensusError prone, one might overlook certain aspects.13

Scaling Hadoop Infrastructure is HARDScaling User Productivity is much HARDER

14

Birth of Dr. Elephant

15

What does Dr. Elephant do?Help every user get the best performance from their jobsAnalyse and compare historical executionsProvides a platform for other performance related tools16

Architecture

17

Rule #1 : Mapper Data Skew18

Mapper Skew ProblemVarying size of splits can cause skewness in the Mapper Input

19

Solution to Mapper SkewnessEach Mapper should process the same amount of dataCombine the small chunks and feed it to a single Mapper

20

Rule #2 : Mapper Memory21

Mapper Memory Problem & SolutionRequested Container Memory >> Tasks Consumed MemoryRequest 4 GB of containerActually job uses only 512 MBWait longer to get 4 GB and then block 4GB of resources!Request a lower container memory by setting mapreduce.map(or reduce).memory.mb

22

Search

23

MapReduce Report

24

Job History

25

How to define a rule?26

How does a Rule work?INPUTCounters & Task DataLOGICSome logic to compute a valueOUTPUTCompare value against threshold levels

27

Customising Dr. Elephant

28

Adding a Custom RuleCreate a new Rule and test it.Create a help page defining the rule, parameters to tune etc.Add the details of the Rule in the HeuristicConf.xml file Mapreduce Rule Name path.to.rule.class path.to.rule.help.page

Run Dr. Elephant. It should now include the new rules.29

What else can you customize?Rules, set threshold levelsEasily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)Enable/disable and extend to new FetchersExtend to newer application types and job types30

Production Gatekeeper

31

Automated Production Reviews | JIRA BotCluster for critical workloadsAudit before deployment32

Workflow monitoring and reportsMonitor performance on each executionCompare behaviour across revisionsCost to Serve analysis

33

Open Source, April 2016

github.com / linkedin / dr-elephant34

WatchersStars Forks 60 262 109

Lets collectively contribute!

35

Pull Requests 60 +Contributors 10 + User Topics 50 +

Dr. Elephant Community

36

Coming Soon37Real time analysis of JobsAnalytics for Failed JobsVisualizing Workflows through DAGsSupport for Other schedulers and Frameworks

ReferencesEngineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-sparkOpen Source Github Link:github.com/linkedin/dr-elephantMailing List & Gitterdr-elephant-users, linkedin/dr-elephant

Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA (Mark Wagner)38

github.com / linkedin / dr-elephant

Thank You

39Akshay Raihttps://in.linkedin.com/in/akshayrai09

201640

2014 LinkedIn Corporation. All Rights Reserved.2014 LinkedIn Corporation. All Rights Reserved.