Watching Pigs Fly with the Netflix Hadoop Toolkit

Preview:

DESCRIPTION

Frameworks and technologies in the Hadoop ecosystem are undergoing rapid innovation, but the open source tooling around usability has lagged behind. We will present a suite of tools, deployable on top of the Hadoop ecosystem, that enables even non-technical users to develop, tune, and maintain efficient Pig workflows and easily interact with and visualize datasets. Netflix?s big data teams have worked for the past year implementing this framework in the AWS cloud. During that time, we have seen a massive influx of data and a corresponding increase in new development on our platform. This toolset has been a critical enabler in minimizing development time and effort. Using the development of a recommendation algorithm as an example, we?ll walk through use cases for this stack of tools, showing how they interact to facilitate development. The presentation will include demos, implementation details, and our roadmap to open source various key services in the framework, including restful services that: provide comprehensive metadata management across data sources; enable visualization and caching of results of Hadoop jobs; visualize the execution plans produced by languages such as Pig and Hive; and provide detailed analytics on the currently executing workload and trends in historical performance.

Citation preview

Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Summit 2013San Jose, CA

Data should be accessible, easy to discover, and easy to process for everyone.

Our Motivation

Our Users

Analysts Engineers

Hadoop Platform as a Service

Hadoop Platform as a Service

S3

Hadoop Platform as a ServiceData Platform

Data Platform as a Service

Franklin(Metadata API)

Sting(Adhoc Visualization)

Forklift (Data Movement)

Looper(Backloading)

Ignite(A/B Test Analytics)

Spock(Data Auditing)

Genie(Hadoop PaaS)

Lipstick(Pig Workflow Visualization)

Event Service(Orchestration)

Hadoop

S3

Other Processing

Let’s solve a problem using the data!

Build a recommender.

But, what makes good recommendations?Similarity

Personalization

COLORS!

COLORS!Box art is colorful…

We’re Sorry

COLORS!Box art is colorful…

Where can I find the data?

Hadoop Platform as a Service

S3

Hadoop Platform as a Service

S3Cassandra TeradataRedshiftRDS

Data Platform as a Service

Franklin(Metadata API)

S3Cassandra TeradataRedshiftRDS

Data Platform as a Service

Franklin(Metadata API)

Create a dataset for box art and color.

Whether your dataset is large or small, being able to visualize it makes it easier to explain.

Data Platform as a Service

Franklin(Metadata API)

Sting(Adhoc Visualization)

Sting

• Allows users to cache the results of a genie job in memory

• Sub second response to OLAP style operations (slicing, dicing, aggregations).

• Adhoc / recurring schedule• Easy to use!

HiveQuery

Schema

% Content Consumed / Hour

HemlockGrove

House ofCards

ArrestedDevelopment

Similarity

House ofCards Macbeth

Toddlers& Tiaras

Star Trek:Voyager

Personalization

# of subscribers X # of titles = ???,000,…,000 (big data)

Big Data

Netflix Apache Pig

Lipstick

Data Platform as a Service

Franklin(Metadata API)

Sting(Adhoc Visualization)

Lipstick

• Allows users to visualize their data flow• Allows users to see common errors• Allows users to easily monitor their jobs• Empowers users to support themselves• Facilitates communication between

infrastructure team and users

Lipstick

Overall JobProgress

LogicalPlan

Overall JobProgress

Logical Operator(reduce side)

Logical Operator(map side)

Map/Reduce Job

Intermediate Row Count

RecordsLoaded

HadoopCounters

My Job has stalled.

Common Problem #1

Unoptimized/OptimizedLogical Plan Toggle

Dangling Operator

I didn’t get the data I was expecting

Common Problem #2

I don’t understand why my job failed.

Common Problem #3

Failed Job(light red background)

Successful Job(light blue background)

Wrapping up

• Demos at the Netflix booth in the exhibit hall (see more Lipstick, Sting, and Genie).

• Lipstick is part of Netflix OSS.• Clone it on github at http:

//github.com/Netflix/Lipstick• We welcome feedback and contributions!

Charles Smith: charsmith@netflix.com Jeff Magnusson: jmagnusson@netflix.com

Thank you!

Jobs: http://jobs.netflix.comNetflix OSS: http://netflix.github.io

Tech Blog: http://techblog.netflix.com/

Recommended