The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystemalong with what I wish someone would have explained

to me before I started working with Hadoop

Jeff CrawfordAssociate ProfessorLipscomb University

https://www.linkedin.com/in/crawdoctor

Presented at the 2014 Analytics Summit (Franklin, TN)September 10, 2014

Philip BestData Science Architect

HCA

About Lipscomb’s CCT…Lipscomb’s College of Computing and Technology offers the following graduate programs:– MS in Information Technology– MS in Informatics & Analytics– MS in Software Engineering

Programs are designed with working professionals in mind. Earn a MS degree in as little as 12 months. GRE is waived for those with 5 or more years work experience in their area of study. See one of the Lipscomb folks for more information.

Visit http://technology.lipscomb.edu/ to learn more and apply


http://technology.lipscomb.edu/

A Spineless DisclaimerAll the thoughts in this presentation are:– The result of learning a speaker dropped from the

conference on Monday morning…– Intended for good and not harm– Derived from lots of reading, research, discussions

and personal experience– Will be review / old-hat for some of you– Not intended to be wholly inclusive– Most likely correct


Presentation Agenda

1. What Hadoop is2. What Hadoop isn’t3. What Hadoop looks like4. Why didn’t anyone tell me? Things I

wish I would have known when getting started

5. How to get started and get experience


Wha

t Had

oop

isDefinition somewhat depends on where you sit…• Infrastructure– Concerned with cluster implementation including

but not limited to issues of performance, availability, scalability, etc.

• Data Science Proper– Concerned with extracting meaning from large and

messy data sources• Business Intelligence / Reporting– Concerned with delivering actionable information to

the right people at the right time• Management– Concerned with economically pursuing business

objectives


• Designed to solve a specific type of problem– How do I provide structure and meaning to a

large and/or rapidly changing and/or unstructured set of data?

• Designed to address several limitations with traditional RDMBS’s


Wha

t Had

oop

is

• A distributed storage and processing framework that is abstracted from “users”– HDFS– MapReduce

• Open-source software (Apache Software Foundation) derived from organizations that (sometimes) like to share, such as– Google– Yahoo!– Facebook– Cloudera / HortonWorks


Wha

t Had

oop

is

• Java-based• Batch processing– HDFS designed for “write once, read many”

operations• Flexible– Can work with all types of data, constrained

by your ability to program structure– Can use a variety of languages beyond Java to

interact with Hadoop• Python, Perl, C++, R, etc.

• Resilient– Built with a “design to fail” mentality– “Rack aware” storage


Wha

t Had

oop

is

• Designed to utilize (mostly) commodity (COTS) hardware

• Linear-ish scalability• Extensible– Ever-moving, ever-changing, ever-evolving


Wha

t Had

oop

is


Image pulled from http://techblog.baghel.com/index.php?itemid=132, details at http://hadoop.apache.org/

Wha

t Had

oop

is

http://techblog.baghel.com/index.php?itemid=132



http://hadoop.apache.org/



Image pulled from http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html

Wha

t Had

oop

is

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html



In Summary, Hadoop provides a distributed storage and computation platform– From existing hardware of varying quality• Scaling out, not up• Whatever hardware you can get your hands on

– Which handles data storage and resiliency• Using HDFS to store files• Built-in redundancy factor

– With a unified computation framework• Making the traditionally hard task of parallel

programming more attainable• Automatically leverages locality of data


Wha

t Had

oop

is

• A replacement for RDBMS’s• A solution for every type of problem– Batch processing– Expectation of “large” files

• Free• Straightforward to administer / manage /

work with• A means of simplifying the definition of

business objectives• The path to operational zen


Wha

t Had

oop

IS N

OT

HDFS• A file system– Files are stored in blocks (typically 64MB or 128MB)– Blocks are stored across multiple devices to address

fault tolerance and performance issues• Can be “rack aware”

• Utilizes two types of machines (aka, nodes)– Namenode: Contains information on the location of

all files in the filesystem (metadata)• Potential single point of failure so in comes use HDFS HA• Can use secondary name node, but it is a simple backup

using checkpoints

– Datanode: Contains actual files• POSIX-like commands


Wha

t Had

oop

look

s lik

e

MapReduce• A logical framework for distributed

computation• Genius is that you perform compute

processes as close as possible to the actual data– Minimize network costs

• Two versions currently: MRv1 and YARN (MRv2)


Wha

t Had

oop

look

s lik

e

MapReduce• At a high level, the process involves…

1. Prepare the Mapper environment – identify initial key/value pair to address within the dataset and distribute Mapper to appropriate nodes

2. Run Mapper code on data to produce key/value pairs3. Organize (e.g., “shuffle”) the Mapper output and send

to identified Reducers for further processing4. Run Reducer code on data to produce key/value pairs5. Collect all the Reducer output (sorted by final key)

• Canonical example when getting started with Hadoop is writing a MapReduce job that will count the number of words in a given corpus (e.g., set of files)


Wha

t Had

oop

look

s lik

e

MapReduce


Wha

t Had

oop

look

s lik

e

Figure from http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/

http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/



WordCount Example – MapReduce via Python


WordCount Example – MapReduce via Python


hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -mapper ./mapper.py -reducer ./reducer.py -input books/* -output WordCount/v1 -file ./mapper.py -file ./reducer.py


Image pulled from http://techblog.baghel.com/index.php?itemid=132, details at http://hadoop.apache.org/

Wha

t Had

oop

is






Hadoop Ecosystem• Things often heard when people are

introduced to Hadoop…– Java? Seriously? Do I look like a sadist?

• Python, Perl, Ruby, PHP, etc. via Hadoop Streaming– “Any language able to read from stdin, write to sdtout

and parse tab and new line characters will work”

• Apache Pig – scripting that simplifies the MapReduce process• Apache Hive – SQL-ish code that allows you to

generate MapReduce programs


Wha

t Had

oop

look

s lik

e


introduced to Hadoop…– All work and no play makes Jack a dull boy

• Apache Spark - provides in-memory processing capabilities

• Apache HBase – provides random, real-time read/write access to large datasets– Hadoop’s NoSQL column-oriented data store

• Cloudera’s Impala – Hive-like but provides a better data warehouse-ish experiences

– What if I wanted to MapReduce my MapReduce job?• Apache Oozie – provides mechanism for scheduling

MapReduce jobs


Wha

t Had

oop

look

s lik

e


introduced to Hadoop…– And the data will come from where?• Apache Flume – allows pulling log type files from

external sources• Apache Sqoop – allows back and forth transfer of

data between Hadoop and most RDBMS’s

– What about data mining?• Mahout – provides a machine learning library for

Hadoop • R Connectors – allows you to utilize R as a front-

end for working with Hadoop


Wha

t Had

oop

look

s lik

e

Revisiting WordCount with Pig


A = load 'books/*';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate group as word, COUNT(B) as word_count;E = order D by word_count desc;store E into 'pig/wordcount_v1';

WordCount Example - Pig


Utilizing Hive


Consider a dataset that contains airline departure / arrival data for major US airports. We would like to generate some simple descriptive statistics for the large dataset. Simple in SQL, not so much in MapReduce…

select carrier, count(carrier) as carrier_count, sum(if(departuredelay IS NULL,1,0)) as dep_delay_null,max(departuredelay) as dep_delay_max, min(departuredelay) as dep_delay_min, avg(departuredelay) as dep_delay_avg,stddev(departuredelay) as dep_delay_stddev from flightdata group by carrier order by carrier;

Hadoop Things to Know• The environment is fragile– LOTS of moving, changing parts

• Everything is configurable• Environment is not intuitive to most IT professionals• All roads start and end with Java• MapReduce jobs can get complex… keep them

simple and chain when necessary• All nodes must have all tools available that are

referenced in the MapReduce code• Tasks run in their own Java Virtual Machine– Can cause unnecessary overhead when there are many

tasksWhy

did

n’t a

nyon

e te

ll m

e?https://www.linkedin.com/in/crawdoctor

• Get Hadoop: The Definitive Guide• Learn Java or Python basics– Those with OO experience, Java might be best– All others, give Python a shot– Find a good IDE…

• Use Cloudera’s Quickstart VM– Requires VirtualBox or VMWare and sufficient

resources on your computer• Find some peers to help you navigate

through the questions that will ariseGetti

ng st

arte

d: P

art 1


If you want to run Hadoop on your own hardware…• Install Cloudera in Pseudo-distributed mode– Utilize virtualization if you don’t have a

dedicated machine• Install CDH 5 in a cluster configuration– At least 3 machines required

If you don’t want to use your own hardware• Sign up for a free Amazon AWS account and

follow docs RE Elastic MapReduce (EMR)• http://aws.amazon.com/elasticmapreduce/

Getti

ng st

arte

d: P

art 2


http://aws.amazon.com/elasticmapreduce/

Dat

a Sc

ienc

e is

a T

eam

Spo

rthttps://www.linkedin.com/in/crawdoctor

Hacker

Scientist

Trusted advisor

Quantitative analyst

Business expert

Technologist

Project manager

Thoughts from Chapter 4 of Davenport (2014) & a bit of Crawford

Data & Analytics

The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by Jeff Crawford