38
Deploying and Researching Hadoop in Virtual Machines

project--2 nd review_2

Embed Size (px)

Citation preview

Page 1: project--2 nd review_2

Deploying and Researching Hadoop in Virtual Machines

Page 2: project--2 nd review_2

Hadoop:

• Hadoop is an open source software platform. • It is derived from Google’s MapReduce and GFS(Google file

system).• Hadoop is an open source implementation of MapReduce.• It develops open source software for reliable and scalable distributed

computing. Definition:• Basically, it's a way of storing enormous data sets across clusters of

computers . • It is designed to be Robust and Efficient.• The Apache Hadoop software library is a framework .• It is designed to scale up from single servers to thousands of

machines.

Page 3: project--2 nd review_2

Who uses Hadoop?

Page 4: project--2 nd review_2

Abstract:

• Hadoop's emerging and the maturity of virtualization make it feasible.

• It introduces some technologies used such as CloudStack, MapReduce and Hadoop.

• How to deploy Hadoop in virtual machines which can be obtained from Cloud Stack .

• we run some Hadoop programs under the virtual cluster.

Page 5: project--2 nd review_2

Introduction:

• Now a days, the most frequently used programs are those Internet based services.

• MapReduce can process 20 PB of data per day.• Ability to read and write data.• A reliable shared storage and analysis system (HDFS and

MapReduce)• Enables applications to work .

Page 6: project--2 nd review_2

Literature survey:

• Ignoring the data locality issue in different types of environments can easily reduce the MapReduce performance.

• Experimental results on two real data-intensive applications show that their data placement strategy.

• The first generation of Hadoop had two single points of failure: the NameNode and JobTracker processes.

• Hadoop MapReduce has two main services: the jobtracker and the tasktracker.

Page 7: project--2 nd review_2

Existing System:

• Need to process terabytes of data in efficient manner on daily bases.

• In the existing system we are using single virtual machine.• The disadvantage is that the potential for poor performance

and heavy load undoubtedly, which is what to be solved .

Page 8: project--2 nd review_2

Proposed System:

• In the proposed system we are using cloud stack infrastructure. • MapReduce is designed under cluster, management of thousands

commodity PCs is a big job. • Deploying the Hadoop Applications on virtual machines .• Maybe the biggest problem is the power consumption.

Page 9: project--2 nd review_2

Modules:

• Module 1: User has to start namenode, datanode, jobtracker and task tracker nodes based on the virtual machine.

• Module2: User observes the virtual machines running on cluster infrastructure.

• Module3: User can connect to any virtual machine running on cluster by providing required details.

• Module4: In this module user can deploy the files on connected virtual machine and do research on any virtual machine.

Page 10: project--2 nd review_2

Hardware Requirements

• Pentium 4 Processor • 8GB RAM• 64 bit OS(Ubuntu)• 200 GB HDD

Page 11: project--2 nd review_2

Software Requirements

• Java 6• Eclipse Indigo (With Hadoop Configuration)• Hadoop Appliance• Cygwin• CloudStack

Page 12: project--2 nd review_2

ARCHITECTURE

Page 13: project--2 nd review_2

3-Tier Architecture

Page 14: project--2 nd review_2

Master/Slave Architeture

Page 15: project--2 nd review_2

HDFS Architecture

Page 16: project--2 nd review_2

DESIGNING

Page 17: project--2 nd review_2

CLASS DIAGRAM Start node

nameNodePort : numberdataNamePort : numberhdfsPort : numbercommand : stringnodeName : string

start()format()

Researchquery : string

submit()cancel()

Deploy filesfileName : stringpath : stringdirectory : string

deploy()cancel()

Connect to VMportNo : numberhostName : string

connect()cancel()

Page 18: project--2 nd review_2

USECASE DIAGRAM

name node

data node

start job tracker

connect to VM

logout

deploy files

research on files

user

Page 19: project--2 nd review_2

SEQUENCE DIAGRAM

user HDFS

start name node

response

data noderesponse

job tracker

response

deploy files

response

research on filesresponse

logout

response

Page 20: project--2 nd review_2

COLLABORATION DIAGRAM

user HDFS

1: start name node

2: response

3: data node

4: response

5: job tracker

6: response

7: deploy files

8: response

9: research on files

10: response

11: logout

12: response

Page 21: project--2 nd review_2

TESTING Black Box Testing White Box Testing Grey Box Testing Regression Testing

Page 22: project--2 nd review_2

Test CasesName Input Output

Activate Root Account Username and password Successfully Enabled

Starting management

Server

Management Server Details

 

 

Successfully started

Adding Pod Pod details

 

Successfully Added

Adding Zone Zone Details

 

Successfully Added

Adding Cluster Cluster Details

 

Successfully Added

Primary Storage Primary Storage Details

 

 

Successfully Added

Secondary Storage Secondary Storage Details

 

Successfully Added

Page 23: project--2 nd review_2

OUTPUTSCREENS

Page 24: project--2 nd review_2

Home Page

Page 25: project--2 nd review_2

Dash Board

Page 26: project--2 nd review_2

Instances

Page 27: project--2 nd review_2

Network

Page 28: project--2 nd review_2

Events

Page 29: project--2 nd review_2

Accounts

Page 30: project--2 nd review_2

Domains

Page 31: project--2 nd review_2

Infrastructure

Page 32: project--2 nd review_2

Projects

Page 33: project--2 nd review_2

Global Settings

Page 34: project--2 nd review_2

Service Settings

Page 35: project--2 nd review_2

Conclusion:

• This Project CloudStack, MapReduce programming model and Hadoop, which allows distributed parallel running, which shows that it is feasible to deploying and research Hadoop in Virtual machines . The advantages are that it can ease the management, fully utilize the computing resources, make Hadoop more reliable and save power and so on. Then some methods to optimize Hadoop in virtual machines are discussed.

Page 36: project--2 nd review_2

Future Enhancements

• Right Management:For example, we can arrange a test administrator to

be responsible for this experimental course, then the experimental teachers can only view and count related information of experimental course, other courses do not have permission. • Experimental Control and Report Submission: 

The instructor can specify the actionable experimental project, and the system design experimental record, save the 1219 experimental project information that students have taken in pilot project, facilitate faculty management .

Page 37: project--2 nd review_2

BIBLIOGRAPHY• List of Reference Documents:• Grady Brooch, “The Unified Modeling Language Users guide” • Roger S Pressman, “Software Engineering”, A practitioners

approach• Walker Royce, “Software Project Management”• Head First Series for Java

• Web References:• http://en.wikipedia.org/wiki/HDFS#Hadoop_distributed_file_sy

stem• http://hadoop.apache.org/• http://en.wikipedia.org/wiki/Mapreduce• http://en.wikipedia.org/wiki/Main_Page• http://cloudstack.apache.org/about.html

Page 38: project--2 nd review_2

Thank you for

watching…!