My First Hadoop Program !!!

BIG Data

Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity. Despite these problems, big data has the potential to help companies improve operations and make faster, more intelligent decisions.

Big Data: Volume or a Technology?

While the term may seem to reference the volume of data, that isn't always the case. The term big data, especially when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities. The term big data is believed to have originated with Web search companies who needed to query very large distributed aggregations of loosely-structured data.

An Example of Big Data

An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible.

Byte of Data : One grain of rice

Kilobyte : Cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 containers lorries

Terabyte : 2 containers ships

Petabyte : Covers manhattan

http://www.webopedia.com/TERM/E/exabyte.html

http://www.webopedia.com/TERM/P/petabyte.html

http://www.webopedia.com/TERM/Q/query.html

http://www.webopedia.com/TERM/S/search_engine.html

http://www.webopedia.com/TERM/S/software.html

http://www.webopedia.com/TERM/D/database.html

http://www.webopedia.com/TERM/D/data.html

http://www.webopedia.com/TERM/S/structured_data.html

http://www.webopedia.com/TERM/B/buzzword.html

Exabyte : Covers UK 3 times

Zettabyte : Fills the pacific ocean

For more insights :

http://www.webopedia.com/TERM/B/big_data_analytics.html

http://blog.yantrajaal.com/2015/04/hadoop-hortonworks-h2o-machine-learning.html

http://www.datamation.com/applications/big-data-analytics-overview.html

My first program in Hadoop

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, the minimum requirements for running Hadoop is a 6 GB+ ram memory installed machine.

Step 1: Installing Oracle Virtual Box and Hadoop

Oracle virtual box is needed to run the host server for Hadoop.

Hadoop setup can be downloaded from the following link :

http://hortonworks.com/hdp/downloads/

The hortonworks, is a company that focuses on the development and support of Apache Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers.

For more info : http://hortonworks.com/hadoop/

The oracle screen post the successful installation of hadoop will look as below:

Step 2: Starting Hadoop

Once the installation is complete, open the hortonworks platform. A new window will open like the one shown below

After hadoop is installed the Hortonworks Sandbox session should be accessed typing the link in a new web tab as in the below screenshot:

https://en.wikipedia.org/wiki/Distributed_processing

https://en.wikipedia.org/wiki/Apache_Hadoop

https://en.wikipedia.org/wiki/Apache_Hadoop

http://hortonworks.com/hdp/downloads/

http://blog.yantrajaal.com/2015/04/hadoop-hortonworks-h2o-machine-learning.html

http://www.webopedia.com/TERM/B/big_data_analytics.html

Step 3: Logging into sandbox

Login: root

Password: Hadoop

Step 4: Creating the directory

A directory should be created into Shellscript (SSH) by the following code :

mkdir WCclasses

Step 5: Running the given programs required to do the specified tasks

Files needed for the files can be downloaded from the following link :

https://www.dropbox.com/s/s1pirjdqr8wf4jy/JavaWordCount.zip?dl=0

Run the following three programs

#Sum reduce.java

#WordCount.java

#WordMapper.java

Step 4: Uploading java program to Shell script

All the above are loaded into the shell script using the following codes :

After each java program is ran one by one it is saved by coming out of the root directory, and typing

Vi <program name>.java

The program executed will be saved under the directory

Step 5: Shell script for compiling, running the java program

The java programe saved under the directory WCclasses , after compilation is saved under programe(name).class jar name as under:

From the above screen shot we can see all 3 jar files have been complied and saved under SumReducer.class, WordCount.class, WordMapper.class file names respectively. Post compilation the files are deflated as in the screenshot above.

Step 6: Reflection of the hadoop libraries in HDP distribution

The program complied as above can be executed in the HDP distribution by the following code —

hdfs dfs -ls /user/hue

hdfs dfs -ls /user/hue/wc – inp

hdfs dfs -rm -r /user/hue/wc-out2

-The jar files so saved for word counting are reflected in the hadoop libraries by the following code–(as in screenshot above):

hadoop jar WordCount.jar WordCount /user/hue/wc-inp /user/hue/wc-out2

Step 7: Uploading the text files into the directory using HUE

Post step 6, one has to log into HUE with the url, username & password given for Hue as under on the screenshot:

Step 8: Modification of the shell scripts to point to correct input and output directories

After login into HUE, the correct shell scripts have to be defined to point out input & output directories, the same is done using the file browser in HUE screen tab as under:

Step 9: Compilation & Execution of the Java programs

The wc-out2 is the file output directory the Word count program executed is saved for output in this directory as under:

Step 10: WordCount program output

The wordcount output post the compilation of the 3 java programs can be seen under the Job Browser tab of Hue using the username root, through the same we can view the java complied program is succeeded and the same is saved under WordCount.jar directory for output as under:

The Wordcount of the 3 java program complied in steps above is in hadoop

hue wc-out2/part-r-00000 file path as under:

We can see from the screenshots the word count from the java programs Sumreducer.java, Wordmapper.java & Wordcount.java.

Data & Analytics

My First Hadoop Program !!!