9
GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

Embed Size (px)

Citation preview

Page 1: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

GROUP 7

TOOLS FOR BIG DATA

Sandeep PrasadDipojjwal Ray

Page 2: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

Objectives...

Apache Hadoop Apache hadoop v1.0.3 and v1.0.4 successful

installation Wordcount functionality by hadoop mapreduce Estimating value of 'Pi' by hadoop mapreduce MapReduce and HDFS

Page 3: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

Apache Hadoop...

High-Availability Distributed object-oriented platform Open Source Pseudo-Distributed single-node cluster A part of Apache Lucene project Handles petabytes of data

Page 4: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

Installation of Hadoop v1.0.3 & 1.0.4...

Release Date v1.0.3 : October 12, 2012 Release Date v1.0.4 : May 16, 2012 OS : Ubuntu v12.04 Prerequisites : Sun Java, hduser Configuration

Page 5: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

Examples...

WordCount example :

$ /bin/hadoop jar hadoop-1.0.3-examples.jar wordcount file01.txt

Estimation of 'Pi'

$ /bin/hadoop jar hadoop-1.0.3-examples.jar pi (x) (y)x= Number of mapsy= Sample per mapsRuntime 2.25 seconds (x=10 ; y=100)Estimated value 3.1480000000000

Page 6: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

MapReduce & HDFS...

Divide and conquer algorithm Map() and Reduce() function derive roots from

functional programming JobTracker and TaskTracker NameNode and DataNode Hadoop Distributed File System Java Framework

Page 7: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

References...

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster

http://lintool.github.io/Cloud9/

Data intensive text-processing using Mapreduce Book by Jimmy Lin and Chris Dyer

http://hadoop.apache.org/releases.html

http://www.apache.org/dyn/closer.cgi/hadoop/co

Page 8: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

THANK YOU

Page 9: GROUP 7 TOOLS FOR BIG DATA Sandeep Prasad Dipojjwal Ray

framework written in Javahighly fault-tolerant distributed file system

JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file

The task tracker web UI shows you running and non-running tasks