Upload
ayapparaj-sks
View
345
Download
0
Embed Size (px)
Citation preview
BIG Data
Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity. Despite these problems, big data has the potential to help companies improve operations and make faster, more intelligent decisions.
Big Data: Volume or a Technology?
While the term may seem to reference the volume of data, that isn't always the case. The term big data, especially when used by vendors, may refer to the technology (which includes tools and processes) that an organization requires to handle the large amounts of data and storage facilities. The term big data is believed to have originated with Web search companies who needed to query very large distributed aggregations of loosely-structured data.
An Example of Big Data
An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on). The data is typically loosely structured data that is often incomplete and inaccessible.
Byte of Data : One grain of rice
Kilobyte : Cup of rice
Megabyte : 8 bags of rice
Gigabyte : 3 containers lorries
Terabyte : 2 containers ships
Petabyte : Covers manhattan
Exabyte : Covers UK 3 times
Zettabyte : Fills the pacific ocean
For more insights :
http://www.webopedia.com/TERM/B/big_data_analytics.html
http://blog.yantrajaal.com/2015/04/hadoop-hortonworks-h2o-machine-learning.html
http://www.datamation.com/applications/big-data-analytics-overview.html
My first program in Hadoop
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, the minimum requirements for running Hadoop is a 6 GB+ ram memory installed machine.
Step 1: Installing Oracle Virtual Box and Hadoop
Oracle virtual box is needed to run the host server for Hadoop.
Hadoop setup can be downloaded from the following link :
http://hortonworks.com/hdp/downloads/
The hortonworks, is a company that focuses on the development and support of Apache Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers.
For more info : http://hortonworks.com/hadoop/
The oracle screen post the successful installation of hadoop will look as below:
Step 2: Starting Hadoop
Once the installation is complete, open the hortonworks platform. A new window will open like the one shown below
After hadoop is installed the Hortonworks Sandbox session should be accessed typing the link in a new web tab as in the below screenshot:
Step 3: Logging into sandbox
Login: root
Password: Hadoop
Step 4: Creating the directory
A directory should be created into Shellscript (SSH) by the following code :
mkdir WCclasses
Step 5: Running the given programs required to do the specified tasks
Files needed for the files can be downloaded from the following link :
https://www.dropbox.com/s/s1pirjdqr8wf4jy/JavaWordCount.zip?dl=0
Run the following three programs
#Sum reduce.java
#WordCount.java
#WordMapper.java
Step 4: Uploading java program to Shell script
All the above are loaded into the shell script using the following codes :
After each java program is ran one by one it is saved by coming out of the root directory, and typing
Vi <program name>.java
The program executed will be saved under the directory
Step 5: Shell script for compiling, running the java program
The java programe saved under the directory WCclasses , after compilation is saved under programe(name).class jar name as under:
From the above screen shot we can see all 3 jar files have been complied and saved under SumReducer.class, WordCount.class, WordMapper.class file names respectively. Post compilation the files are deflated as in the screenshot above.
Step 6: Reflection of the hadoop libraries in HDP distribution
The program complied as above can be executed in the HDP distribution by the following code —
hdfs dfs -ls /user/hue
hdfs dfs -ls /user/hue/wc – inp
hdfs dfs -rm -r /user/hue/wc-out2
-The jar files so saved for word counting are reflected in the hadoop libraries by the following code–(as in screenshot above):
hadoop jar WordCount.jar WordCount /user/hue/wc-inp /user/hue/wc-out2
Step 7: Uploading the text files into the directory using HUE
Post step 6, one has to log into HUE with the url, username & password given for Hue as under on the screenshot:
Step 8: Modification of the shell scripts to point to correct input and output directories
After login into HUE, the correct shell scripts have to be defined to point out input & output directories, the same is done using the file browser in HUE screen tab as under:
Step 9: Compilation & Execution of the Java programs
The wc-out2 is the file output directory the Word count program executed is saved for output in this directory as under:
Step 10: WordCount program output
The wordcount output post the compilation of the 3 java programs can be seen under the Job Browser tab of Hue using the username root, through the same we can view the java complied program is succeeded and the same is saved under WordCount.jar directory for output as under:
The Wordcount of the 3 java program complied in steps above is in hadoop
hue wc-out2/part-r-00000 file path as under:
We can see from the screenshots the word count from the java programs Sumreducer.java, Wordmapper.java & Wordcount.java.