Download pdf - A Birds-Eye View of Pig and Scalding Jobs with hRaven

A Bird’s-Eye View of Pig and Scalding

with hRavena tale by @gario and @joep

Hadoop Summit 2013

v1.2

@Twitter#HadoopSummit2013 2

Apache HBase PMC member andCommitter

Software Engineer @ Twitter

Core Storage Team - Hadoop/HBase

•

••

About the authors

Software Engineer @ Twitter

Engineering Manager Hadoop/HBaseteam @ Twitter

••


Chapter 1: The ProblemChapter 2: Why hRaven?Chapter 3: How Does it Work?

3a: Loading

3b: Table structure / queryingChapter 4: Current UsesAppendix: Future Work

•

•

•

••

•

•

Table of Contents

Chapter 1: The Problem

Illustration by Sirxlem (CC BY-NC-ND3.0)


Most users run Pig and Scalding scripts, not straight map reduceJobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding

•

•

Chapter 1: Mismatched Abstractions

@Twitter#HadoopSummit2013

Chapter 1: A Problem of Scale

6


How many Pig versus Scalding jobs do we run ?What cluster capacity do jobs in my pool take ?How many jobs do we run each day ?What % of jobs have > 30k tasks ?Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?

•

•

•

•

•

Chapter 1: Questions


How many Pig versus Scalding jobs do we run ?What cluster capacity do jobs in my pool take ?How many jobs do we run each day ?What % of jobs have > 30k tasks ?Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?

•

•

•

•

•

Chapter 1: Questions

#Nevermore

Chapter 2: Why hRaven?

Photo by DAVID ILIFF. License: CC-BY-SA3.0


Stores stats, configuration and timing for every map reduce job on everyclusterStructured around the full DAG of jobs from a Pig or Scalding applicationEasily queryable for historical trendingAllows for Pig reducer optimization based on historical run statsKeep data online forever (12.6M jobs, 4.5B tasks + attempts)

•

•

•

•

•

Chapter 2: Why hRaven?


cluster - each cluster has a unique name mapping to the Job Trackeruser - map reduce jobs are run as a given userapplication - a Pig or Scalding script (or plain map reduce job)flow - the combined DAG of jobs executed from a single run of anapplicationversion - changes impacting the DAG are recorded as a new version of thesame application

•

•

•

•

•

Chapter 2: Key Concepts


Chapter 2: Application Flows

Edgar


Chapter 2: Application Flows

Edgar


All jobs in a flow are ordered together•

Chapter 2: Flow Storage


Most recent flow is ordered first•

Chapter 2: Flow Storage


All jobs in a flow are ordered togetherPer-job metrics stored

Total map and reduce tasks

HDFS bytes read / written

File bytes read / written

Total map and reduce slot milliseconds

Easy to aggregate stats for an entire flowEasy to scan the timeseries of each application’s flows

•

•

••••

•

•

Chapter 2: Key Features

Chapter 3: How Does it Work?


Chapter 3: ETL - Step 1: JobFilePreprocessor


Chapter 3: ETL - Step 2: JobFileRawLoader


Chapter 3: ETL - Step 3: JobFileProcessor


Chapter 3: ETL - Step 3: JobFileProcessor

Jobs finish out of order with respect to job_id


job_history_raw

job_history

job_history_task

job_history_app_version

•

•

•

•

Chapter 3: Tables


Row key: cluster!jobID

Columns:

jobconf - stores serialized raw job_*_conf.xml file

jobhistory - stored serialized raw job history log file

job_processed_success - indicates whether job has been processed

•••

Chapter 3: job_history_raw


Row key: cluster!user!application!timestamp!jobIDcluster - unique cluster name (ie. “cluster1@dc1”)

user - user running the application (“edgar”)

application - application ID derived from job configuration:

uses “batch.desc” property if set

otherwise parses a consistent ID from “mapred.job.name”

timestamp - inverted (Long.MAX_VALUE - value) value of submission time

jobID - stored as Job Tracker start time (long), concatenated with job sequence number

job_201306271100_0001 -> [1372352073732L][1L]

•••

••

••

•

Chapter 3: job_history


Row key: cluster!user!application!timestamp!jobID!taskIDsame components as job_history key (same ordering)

taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job

Two row types:Task - “meta” row

cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001

Task Attempt - individual execution on a Task Trackercluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1

••

•

•

Chapter 3: job_history_task


Row key: cluster!user!application

Example: cluster1@dc1!edgar!wordcount

Columns:v1=1369585634000

v2=1372263813000

Chapter 3: job_history_app_version


Using Pig’s HBaseStorage (or direct HBase APIs)Through Client APIThrough REST API

•

•

•

Chapter 3: Querying hRaven

Chapter 4: Current Uses


Pig reducer optimizationsCluster utilization / capacity planningApplication performance trending over timeIdentifying common job anti-patternsAd-hoc analysis troubleshooting cluster problems

•

•

•

•

•

Chapter 4: Current Uses


Chapter 4: Cluster reads-writes


Chapter 4: Pool / Application reads/writes

31

Pool view

Spike in File size read

Indicates jobs spilling

•

••

Application view

Spike in HDFS sizeread

Indicates spiking input

•

•

•


Chapter 4: Pool usage: Used vs. Allocated

32


Chapter 4: Compute cost

Appendix: Future Work


Real-time data loading from Job Tracker / Application MasterFull flow-centric UI (Job Tracker UI replacement)Hadoop 2.0 compatibility (in-progress)Ambrose integration

•

•

•

•

Appendix: Future Work


hRaven on Githubhttps://github.com/twitter/hraven

hRaven Mailing [email protected]

[email protected]

•

••

Additional Resources


Afterword

37

Now will thou drop your job data on the floor ?Quoth the hRaven, 'Nevermore.'

#TheEnd@gario and @joep

Come visit us at booth #26 to continue the story


Desired orderjob_201306271100_9999job_201306271100_10000...job_201306271100_99999job_201306271100_100000...job_201306271100_999999job_201306271100_1000000

•

Sort order Variable length job_idLexical order

job_201306271100_10000job_201306271100_100000job_201306271100_1000000job_201306271100_9999job_201306271100_99999job_201306271100_999999

•