Playing with Hadoop (NPW2013)

Embed Size (px)

Citation preview

Nordic Perl Workshop 2013

Playing with HadoopSren Lund (slu)[email protected]

DISCLAIMER

I have no experience with Hadoop in a real-world project

The installation notes I present are not nescessarily suitable for production

The example scripts have not been used on real (big) data

Hence the title Playing with Hadoop

About Hadoop (and Big Data)

The Problem (it's not new)

!!!!!We have (access to) more and more data

Processing this data takes longer and longer

Not enough memory

Running out of disk space

Our trusty old server can't keep up

Scaling up

Upgrade hardware: bigger and faster

Redundancy: power supply, RAID, hot-swap

Expensive to keep scaling up

Our software will run without modifications

Scaling out

Add more (commodity) servers

Redundancy is replaced by replication

You can keep on scaling out, it's cheap

How do we enable our software to run across multiple servers?

Google solved this

Google published two papersGoogle File System (GFS), 2003
http://research.google.com/archive/gfs.html

MapReduce, 2004

http://research.google.com/archive/mapreduce.html

GFS and MapReduced provided a platform for processing huge amounts of data in an efficient way

Hadoop was born

Doug Cutting read the Google papers

Based on those, he created Hadoop
(named after his sons toy elephant)

It is an implementation of GFS/MapReduce
(Open Source / Apache License)

Written in Java and deployed on Linux

First part of Lucene, now an Apache project

https://hadoop.apache.org/

Hadoop Components

Hadoop Common utilities to control the rest

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

MapReduce YARN-based parallel processing

This enables us to write software that can handle Big Data by scaling out

Big Data isn't just big

Huge amounts of data (volume)

Unstructured data (form)

Highly dynamic data (burst/change rate)

Big Data is actually hard-to-handle (with traditional tools/methods) data

Examples of Big Data

Log files, i.e.web server access logs

application logs

Internet feedsTwitter, Facebook, etc.

RSS

Images (face recognition, tagging)

Installing Hadoop

Needed to run Hadoop

You need the following to run HadoopLinux server

Java JDK

Hadoop tarball

I'm using the followingUbuntu 12.04 LTS 64 bit

JDK 1.6.24 64 bit

Hadoop 1.0.4

Could not get JDK7 + Hadoop 2.2 to work

Install Java

Setup Java home and path

Add hadoop user

Create SSH key for hadoop user

Accept SSH key

Install Hadoop and add to path

Disable IPv6

Reboot and check installation

Running an example job

Calculate Pi

Estimated value of Pi

Three modes of operation

Pi was calculated in Local standalone modeit is the default mode (i.e. no configuration needed)

all components of Hadoop run in a single JVM

Pseudo-distributed modea separate JVM is spawned for each component

components communicate using sockets

it is a mini-cluster on a single host

Fully distributed modecomponents are spread across multiple machines

Create base directory for HDFS

Set JAVA_HOME

Edit core-site.xml

Edit hdfs-site.xml

Edit mapred-site.xml

Log out and log on as hadoop

Format HDFS

Start HDFS

Start Map Reduce

Create home directory & test data

Running Word Count

First let's try the example jar

Inspect the result

Compile and run our own jar

https://gist.github.com/soren/7213273

Inspect result

Run improved version

https://gist.github.com/soren/7213453

Inspect (improved) result

Hadoop MapReduce

A reducer will get all values associated with a given key

Precursor job can be used to normalize data

Combiners can be used to perform early sorting of map output before it is send to the reducer

Perl MapReduce

Playing with MapReduce

We don't need Hadoop to play with MapReduce

Instead we can emulate Hadoop using two scripts

wc_mapper.pl a Word Count Mapper

wc_reducer.pl a Word Count Reducer

We connect them using a pipe (|)

Very Unix-like!

Run MapReduce without Hadoop

https://gist.github.com/soren/7596270

https://gist.github.com/soren/7596285

Hadoop's Streaming interface

Enables you to write jobs in any programming language, e.g. Perl

Input from STDIN

Output to STDOUT

Key/Value pairs separated by TAB

Reducers will get values one-by-one

Not to be confused with Hadoop Pipes that provides a native C++ interface to Hadoop

Run Perl Word Count

https://gist.github.com/soren/7596270

https://gist.github.com/soren/7596285

Inspect result

Hadoop::Streaming

Perl interface to Hadoop's Streaming interface

Implemented in Moose

You'll can now implement you MapReduce asa class with a map() and reduce() method

a mapper script

a reducer script

Installing Hadoop::Streaming

Btw, Perl was already installed on the server ;-)

But we want to install Hadoop::Streaming

I also had to install local::lib to make it work

All you have to do issudo cpan local::lib Hadoop::Streaming

Nice and easy

Run Hadoop::Streaming job

https://gist.github.com/soren/7596451

https://gist.github.com/soren/7600134

https://gist.github.com/soren/7600144

Inspect result

Some final notes and loose ends

The Web User Interface

HDFShttp://localhost:8030/

MapReducehttp://localhost:8070/

File Browserhttp://localhost:8075/browseDirectory.jsp?namenodeInfoPort=8070&dir=/

Note: this is with port forwarding in VirtualBox50030 8030, 50070 8070, 50075 8075

Joins in Hadoop

It's possible to implement joins in MapReduceReduce-joins simple

Map-joins less data to transfer

Do you need joins?Maybe you're data has structure SQL?

Try Hive (HiveQL)

Or Pig (Pig Latin)

Hadoop in the Cloud

Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/

Essentially Hadoop in the Cloud

Build on EC2 and S3

You can upload JARs or scripts

There's more

DistributionsCloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/

Hortonworks Data Platform (HDP)
http://hortonworks.com/

HBase, Hive, Pig and other related projects
https://hadoop.apache.org/

But, a basic Hadoop setup is a good startand a nice place to just play with Hadoop

I like big data and I can not lie

Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys setups
Who understands those Hadoop guys
They only map/reduce it because it is on a distributed file systemI mean the data, it's just so big
I can't believe it's so huge
It's just out there, I mean, it's gross
Look, it's just so blah

The End

Questions?

Slides will be available at http://www.slideshare.net/slu/Find me on Twitter https://twitter.com/slu

Muokkaa otsikon tekstimuotoa napsauttamalla

Muokkaa jsennyksen tekstimuotoa napsauttamallaToinen jsennystasoKolmas jsennystasoNeljs jsennystasoViides jsennystasoKuudes jsennystasoSeitsems jsennystasoKahdeksas jsennystasoYhdekss jsennystaso

/