Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of

Presented by: Katie Woods and Jordan Howell

Hadoop

What is Hadoop?*Hadoop is a distributed computing platform

written in Java. It incorporates features similar to those of the Google File System and of MapReduce.

*It provides a distributed filesystem (HDFS) that can store data across thousands of servers, and a means of running work (Map/Reduce jobs) across those machines, running the work near the data.

*It runs on Java 1.6.x or higher with support for both Linux and Windows

*Software is resilient because it is great at detecting and handling failures at the application layer.

What does it do?*Hadoop provides a way to solve problems

using big data.

*It can be used to interact with and analyze data that doesn't fit neatly in a database (financial portfolios, targeted product advertising).

Brief History

*Hadoop was originally created by a Yahoo! Engineer, Doug Cutting and Michael J. Cafarella. The name Hadoop came from Doug’s son’s stuffed toy elephant, which is also where Hadoop got its logo.

*Hadoop was originally built as an

infrastructure for the Nutch project,

which crawls the web and builds a search

engine index for the crawled pages.

MapReduce

*MapReduce expresses large distributed computation as a sequence of distributed operations on data sets of key/value pairs.

*A Map/Reduce computation has two phases, a map phase and a reduce phase.

Map Phase

*Map

*The framework splits the input data set into a large number of fragments and assigns each fragment to a map task.

*Each map task consumes key/value pairs from its assigned fragment and produces a set of intermediate key/value pairs.

*Following the map phase the framework sorts the intermediate data set by key and produces a set of tuples so that all the values associated with a particular key appear together.

*The set of tuples are partitioned into a number of fragments equal to the number of reduce tasks that will be performed.

Reduce PhaseReduce

*Each reduce task consumes the fragment of tuples assigned to it and transmutes the tuple into an output key/value pair as described by the user defined reduce function.

*Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each reduce task.

*`Tasks in each phase are executed in a fault-tolerant manner, if nodes fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes and any data that can be recovered is also re-distributed.

*Having many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead.

Overview - Architecture*Hadoop has a master/slave architecture.

*A single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster.

*The jobtracker is the point of interaction between users and the framework.

*Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis.

*The jobtracker manages the assignment of map and reduce tasks to the tasktrackers.

*The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.

Hardware

*Hadoop is designed to run on a large number of machines that don’t share any memory or disks. (aka. most PCs and servers)

*Scaling is best with nodes that have dual processors/cores with 4-8GB of RAM.

*The best cost/performance places the machines at ½ to 1/3 of the cost of application servers but above the cost of standard desktop machines.

*Bandwidth needed depends on the jobs being run and the number of nodes. Most average jobs produce around 100MB/s of data.

Hadoop Distributed File System (HDFS)

*Designed to reliably store very large files across machines in a large cluster.

*Inspired by the Google File System.

*Each file is stored as a sequence of blocks, all blocks in a file except the last block are the same size.

*Blocks belonging to a file are replicated for fault tolerance.

*The block size and replication factor are configurable per file.

*Files in HDFS are "write once" and have strictly one writer at any time.

HDFS Architecture*HDFS also follows a master/slave architecture.

*Installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients.

*There are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on.

*The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories and also determines the mapping of blocks to Datanodes.

*The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.

Developing for Hadoop

*A Hadoop adopter must be more sophisticated than a relational database adopter.

*There are not that many “turn-key” applications that are ready to use.

*Each company that uses Hadoop will likely have to be adept enough to create their own programs in house.

*Hadoop is written in Java but APIs are available for multiple languages and many vendors are now providing tools to assist.

What Hadoop is not…

*A silver bullet that will solve all application/datacenter problems.

*A replacement for a database or SAN.

*A replacement for an existing NFS.

*Always the fastest solution. If operations rely on the output preceding operations Hadoop/MapReduce will

likely provide no benefit.

*A place to learn java.

*A place to learn networking.

*A place to learn Unix/Linux system administration.

*Companies that use Hadoop

*IBM

*LinkedIN

*Adobe

*Twitter

*Facebook

*Amazon

*Yahoo!

*Rackspace

*Etsy

Just to name a few

*Questions/Comments

References

*Turner, James. January 12, 2011. Hadoop: What it is, How it Works, and what it can do. <http://strata.oreilly.com/2011/01/what-is-hadoop.html>

*What is Hadoop? <http://www-01.ibm.com/software/data/infosphere/hadoop/>

*What is MapReduce? <http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/>

*What is HDFS? <http://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/>

Documents

Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of