Development of Distributed Cluster & Problem Testing

7/31/2019 Development of Distributed Cluster & Problem Testing

1/18


2/18

Presented by:Syed Shabi-ul-hasnain Nazir

Supervised by:Sir Tahir Roshmi


3/18

Distributed Computing:

Distributed computing is a field of computer science that studies distributed

systems.

A distributed system consists of multiple computers that communicate through a

computer network.

The computers interact with each other in order to achieve a common goal.

In distributed computing, a problem is divided into many tasks, each of which is

solved by one or more computers.


4/18

Paradigm means a pattern, example, or model.

There are different types of Paradigms.

The Message Passing Paradigm.

The Client-Server Paradigm.

The Peer-to-Peer System Architecture.

The Message System Paradigm.

Remote Procedure Call.

The Mobile Agent Paradigm.

Groupware Paradigm.


5/18

Hadoop is a software framework that enables distributed manipulation of largeamounts of data. Hadoop does this in a way that makes it reliable, and efficient.

Hadoop is reliable because it assumes that computing elements and storage will fail

and, therefore, it maintains several copies of working data to ensure that processing

can be redistributed around failed nodes.

Hadoop is efficient because it works on the principle of parallelization, allowing

data to process in parallel to increase the processing speed.


6/18

Hadoop is made up of a number of elements.

Hadoop Distributed File System (HDFS)

Map Reduce


7/18

HDFS architecture is built from a collection of special nodes.

There are two types of node in hdfs.

1. Name node

2. Data node

The Name Node (there is only one), which provides metadata services within

HDFS, and the Data Node, which serves storage blocks for HDFS.

Files stored in HDFS in the form blocks and the size of one block is 64 MB.


8/18

Map Reduce is itself a software framework for the parallel processing of large datasets across a distributed cluster of processors or stand-alone computers.

It consists of two operations.

The Map function takes a set of data and transforms it into a list of key/valuepairs, one per element of the input domain.


9/18

The Reduce function takes the list that resulted from the Map function andreduces the list of key/value pairs based on their key (a single key/value pairresults for each key).


10/18

Single pc

PC Name Dell Gx 620 Pentium 4

No of PCs 1

Ram 2 GB

Hard disk 30 GB

Processor 3.4 GHZ

Operating System Ubuntu 11.04

Software eclipse-SDK-3.6.1


11/18

Cluster pcs

Name node

PC Name Core 2 due

No of PCs 1

Ram 2 GB

Hard disk 200 GB

Processor 3.0 GHZ

Operating System Ubuntu11.04

Hadoop Version 0.20.0X

Data node

PC Name Dell Pentium 4

No of PCs 8Ram 1 GB

Hard disk 30 GB

Processor 3.0 GHZ

Operating System Ubuntu11.04

Hadoop Version 0.20.0X


12/18

Sorting using single pc

We sort different data on a single pc and then note the time consumed by sorting

process.

Sorting 1 GB data:

When we sort 1 GB data on a single pc it takes 22mins and 15 sec for sort that data.

Sorting 10 GB data:

When we sort 10 GB data on a single pc it takes 4 hours 6mins and 23 sec for sort

that data.


13/18

Sorting using 8 pcs Cluster

We sort different data on 8 pcs cluster and then note the time consumed by

sorting process.

Sorting 1 GB Data

When we sort 1 GB data using tarasort on multi node cluster it takes 3min and

33sec to complete its job.

Sorting 10 GB Data

When we sort 10 GB data using tarasort on multi node cluster it takes 9min and

54sec to complete its job.


14/18

Data size Sorting time

Sorting on Single Pc 1 GB 22min and 15 sec

10 GB 4 hour 6min and 23 sec

Sorting on cluster 1 GB 3min and 33sec

10 GB 9min and 54sec


15/18

0

5

10

15

20

25

1 2

single pc

8 nodecluster

Comparison Graph for Sorting 1GB Data


16/18

0

50

100

150

200

250

1 2

cluster

single pc

Comparison Graph for Sorting 10GB Data


17/18

When we sort data on cluster it reduces the time as compare to when sort data onsingle pc and When we increase the data size on cluster its time decrease as

compare to single system.


18/18

THANKS

Documents

Development of Distributed Cluster & Problem Testing