Upload
nashvilletechcouncil
View
222
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Originally presented at the 2014 Nashville Analytics Summit
Citation preview
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystemalong with what I wish someone would have explained
to me before I started working with Hadoop
Jeff CrawfordAssociate ProfessorLipscomb University
https://www.linkedin.com/in/crawdoctor
Presented at the 2014 Analytics Summit (Franklin, TN)September 10, 2014
Philip BestData Science Architect
HCA
About Lipscomb’s CCT…Lipscomb’s College of Computing and Technology offers the following graduate programs:– MS in Information Technology– MS in Informatics & Analytics– MS in Software Engineering
Programs are designed with working professionals in mind. Earn a MS degree in as little as 12 months. GRE is waived for those with 5 or more years work experience in their area of study. See one of the Lipscomb folks for more information.
Visit http://technology.lipscomb.edu/ to learn more and apply
https://www.linkedin.com/in/crawdoctor
A Spineless DisclaimerAll the thoughts in this presentation are:– The result of learning a speaker dropped from the
conference on Monday morning…– Intended for good and not harm– Derived from lots of reading, research, discussions
and personal experience– Will be review / old-hat for some of you– Not intended to be wholly inclusive– Most likely correct
https://www.linkedin.com/in/crawdoctor
Presentation Agenda
1. What Hadoop is2. What Hadoop isn’t3. What Hadoop looks like4. Why didn’t anyone tell me? Things I
wish I would have known when getting started
5. How to get started and get experience
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
isDefinition somewhat depends on where you sit…• Infrastructure– Concerned with cluster implementation including
but not limited to issues of performance, availability, scalability, etc.
• Data Science Proper– Concerned with extracting meaning from large and
messy data sources• Business Intelligence / Reporting– Concerned with delivering actionable information to
the right people at the right time• Management– Concerned with economically pursuing business
objectives
https://www.linkedin.com/in/crawdoctor
• Designed to solve a specific type of problem– How do I provide structure and meaning to a
large and/or rapidly changing and/or unstructured set of data?
• Designed to address several limitations with traditional RDMBS’s
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
is
• A distributed storage and processing framework that is abstracted from “users”– HDFS– MapReduce
• Open-source software (Apache Software Foundation) derived from organizations that (sometimes) like to share, such as– Google– Yahoo!– Facebook– Cloudera / HortonWorks
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
is
• Java-based• Batch processing– HDFS designed for “write once, read many”
operations• Flexible– Can work with all types of data, constrained
by your ability to program structure– Can use a variety of languages beyond Java to
interact with Hadoop• Python, Perl, C++, R, etc.
• Resilient– Built with a “design to fail” mentality– “Rack aware” storage
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
is
• Designed to utilize (mostly) commodity (COTS) hardware
• Linear-ish scalability• Extensible– Ever-moving, ever-changing, ever-evolving
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
is
https://www.linkedin.com/in/crawdoctor
Image pulled from http://techblog.baghel.com/index.php?itemid=132, details at http://hadoop.apache.org/
Wha
t Had
oop
is
https://www.linkedin.com/in/crawdoctor
Image pulled from http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html
Wha
t Had
oop
is
In Summary, Hadoop provides a distributed storage and computation platform– From existing hardware of varying quality• Scaling out, not up• Whatever hardware you can get your hands on
– Which handles data storage and resiliency• Using HDFS to store files• Built-in redundancy factor
– With a unified computation framework• Making the traditionally hard task of parallel
programming more attainable• Automatically leverages locality of data
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
is
• A replacement for RDBMS’s• A solution for every type of problem– Batch processing– Expectation of “large” files
• Free• Straightforward to administer / manage /
work with• A means of simplifying the definition of
business objectives• The path to operational zen
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
IS N
OT
HDFS• A file system– Files are stored in blocks (typically 64MB or 128MB)– Blocks are stored across multiple devices to address
fault tolerance and performance issues• Can be “rack aware”
• Utilizes two types of machines (aka, nodes)– Namenode: Contains information on the location of
all files in the filesystem (metadata)• Potential single point of failure so in comes use HDFS HA• Can use secondary name node, but it is a simple backup
using checkpoints
– Datanode: Contains actual files• POSIX-like commands
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
MapReduce• A logical framework for distributed
computation• Genius is that you perform compute
processes as close as possible to the actual data– Minimize network costs
• Two versions currently: MRv1 and YARN (MRv2)
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
MapReduce• At a high level, the process involves…
1. Prepare the Mapper environment – identify initial key/value pair to address within the dataset and distribute Mapper to appropriate nodes
2. Run Mapper code on data to produce key/value pairs3. Organize (e.g., “shuffle”) the Mapper output and send
to identified Reducers for further processing4. Run Reducer code on data to produce key/value pairs5. Collect all the Reducer output (sorted by final key)
• Canonical example when getting started with Hadoop is writing a MapReduce job that will count the number of words in a given corpus (e.g., set of files)
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
MapReduce
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
Figure from http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/
WordCount Example – MapReduce via Python
https://www.linkedin.com/in/crawdoctor
WordCount Example – MapReduce via Python
https://www.linkedin.com/in/crawdoctor
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -mapper ./mapper.py -reducer ./reducer.py -input books/* -output WordCount/v1 -file ./mapper.py -file ./reducer.py
https://www.linkedin.com/in/crawdoctor
Image pulled from http://techblog.baghel.com/index.php?itemid=132, details at http://hadoop.apache.org/
Wha
t Had
oop
is
Hadoop Ecosystem• Things often heard when people are
introduced to Hadoop…– Java? Seriously? Do I look like a sadist?
• Python, Perl, Ruby, PHP, etc. via Hadoop Streaming– “Any language able to read from stdin, write to sdtout
and parse tab and new line characters will work”
• Apache Pig – scripting that simplifies the MapReduce process• Apache Hive – SQL-ish code that allows you to
generate MapReduce programs
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
Hadoop Ecosystem• Things often heard when people are
introduced to Hadoop…– All work and no play makes Jack a dull boy
• Apache Spark - provides in-memory processing capabilities
• Apache HBase – provides random, real-time read/write access to large datasets– Hadoop’s NoSQL column-oriented data store
• Cloudera’s Impala – Hive-like but provides a better data warehouse-ish experiences
– What if I wanted to MapReduce my MapReduce job?• Apache Oozie – provides mechanism for scheduling
MapReduce jobs
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
Hadoop Ecosystem• Things often heard when people are
introduced to Hadoop…– And the data will come from where?• Apache Flume – allows pulling log type files from
external sources• Apache Sqoop – allows back and forth transfer of
data between Hadoop and most RDBMS’s
– What about data mining?• Mahout – provides a machine learning library for
Hadoop • R Connectors – allows you to utilize R as a front-
end for working with Hadoop
https://www.linkedin.com/in/crawdoctor
Wha
t Had
oop
look
s lik
e
Revisiting WordCount with Pig
https://www.linkedin.com/in/crawdoctor
A = load 'books/*';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate group as word, COUNT(B) as word_count;E = order D by word_count desc;store E into 'pig/wordcount_v1';
WordCount Example - Pig
https://www.linkedin.com/in/crawdoctor
Utilizing Hive
https://www.linkedin.com/in/crawdoctor
Consider a dataset that contains airline departure / arrival data for major US airports. We would like to generate some simple descriptive statistics for the large dataset. Simple in SQL, not so much in MapReduce…
select carrier, count(carrier) as carrier_count, sum(if(departuredelay IS NULL,1,0)) as dep_delay_null,max(departuredelay) as dep_delay_max, min(departuredelay) as dep_delay_min, avg(departuredelay) as dep_delay_avg,stddev(departuredelay) as dep_delay_stddev from flightdata group by carrier order by carrier;
Hadoop Things to Know• The environment is fragile– LOTS of moving, changing parts
• Everything is configurable• Environment is not intuitive to most IT professionals• All roads start and end with Java• MapReduce jobs can get complex… keep them
simple and chain when necessary• All nodes must have all tools available that are
referenced in the MapReduce code• Tasks run in their own Java Virtual Machine– Can cause unnecessary overhead when there are many
tasksWhy
did
n’t a
nyon
e te
ll m
e?https://www.linkedin.com/in/crawdoctor
• Get Hadoop: The Definitive Guide• Learn Java or Python basics– Those with OO experience, Java might be best– All others, give Python a shot– Find a good IDE…
• Use Cloudera’s Quickstart VM– Requires VirtualBox or VMWare and sufficient
resources on your computer• Find some peers to help you navigate
through the questions that will ariseGetti
ng st
arte
d: P
art 1
https://www.linkedin.com/in/crawdoctor
If you want to run Hadoop on your own hardware…• Install Cloudera in Pseudo-distributed mode– Utilize virtualization if you don’t have a
dedicated machine• Install CDH 5 in a cluster configuration– At least 3 machines required
If you don’t want to use your own hardware• Sign up for a free Amazon AWS account and
follow docs RE Elastic MapReduce (EMR)• http://aws.amazon.com/elasticmapreduce/
Getti
ng st
arte
d: P
art 2
https://www.linkedin.com/in/crawdoctor
Dat
a Sc
ienc
e is
a T
eam
Spo
rthttps://www.linkedin.com/in/crawdoctor
Hacker
Scientist
Trusted advisor
Quantitative analyst
Business expert
Technologist
Project manager
Thoughts from Chapter 4 of Davenport (2014) & a bit of Crawford