Upload
shabi-ul-hasnain
View
221
Download
0
Embed Size (px)
Citation preview
7/31/2019 Development of Distributed Cluster & Problem Testing
1/18
7/31/2019 Development of Distributed Cluster & Problem Testing
2/18
Presented by:Syed Shabi-ul-hasnain Nazir
Supervised by:Sir Tahir Roshmi
7/31/2019 Development of Distributed Cluster & Problem Testing
3/18
Distributed Computing:
Distributed computing is a field of computer science that studies distributed
systems.
A distributed system consists of multiple computers that communicate through a
computer network.
The computers interact with each other in order to achieve a common goal.
In distributed computing, a problem is divided into many tasks, each of which is
solved by one or more computers.
7/31/2019 Development of Distributed Cluster & Problem Testing
4/18
Paradigm means a pattern, example, or model.
There are different types of Paradigms.
The Message Passing Paradigm.
The Client-Server Paradigm.
The Peer-to-Peer System Architecture.
The Message System Paradigm.
Remote Procedure Call.
The Mobile Agent Paradigm.
Groupware Paradigm.
7/31/2019 Development of Distributed Cluster & Problem Testing
5/18
Hadoop is a software framework that enables distributed manipulation of largeamounts of data. Hadoop does this in a way that makes it reliable, and efficient.
Hadoop is reliable because it assumes that computing elements and storage will fail
and, therefore, it maintains several copies of working data to ensure that processing
can be redistributed around failed nodes.
Hadoop is efficient because it works on the principle of parallelization, allowing
data to process in parallel to increase the processing speed.
7/31/2019 Development of Distributed Cluster & Problem Testing
6/18
Hadoop is made up of a number of elements.
Hadoop Distributed File System (HDFS)
Map Reduce
7/31/2019 Development of Distributed Cluster & Problem Testing
7/18
HDFS architecture is built from a collection of special nodes.
There are two types of node in hdfs.
1. Name node
2. Data node
The Name Node (there is only one), which provides metadata services within
HDFS, and the Data Node, which serves storage blocks for HDFS.
Files stored in HDFS in the form blocks and the size of one block is 64 MB.
7/31/2019 Development of Distributed Cluster & Problem Testing
8/18
Map Reduce is itself a software framework for the parallel processing of large datasets across a distributed cluster of processors or stand-alone computers.
It consists of two operations.
The Map function takes a set of data and transforms it into a list of key/valuepairs, one per element of the input domain.
7/31/2019 Development of Distributed Cluster & Problem Testing
9/18
The Reduce function takes the list that resulted from the Map function andreduces the list of key/value pairs based on their key (a single key/value pairresults for each key).
7/31/2019 Development of Distributed Cluster & Problem Testing
10/18
Single pc
PC Name Dell Gx 620 Pentium 4
No of PCs 1
Ram 2 GB
Hard disk 30 GB
Processor 3.4 GHZ
Operating System Ubuntu 11.04
Software eclipse-SDK-3.6.1
7/31/2019 Development of Distributed Cluster & Problem Testing
11/18
Cluster pcs
Name node
PC Name Core 2 due
No of PCs 1
Ram 2 GB
Hard disk 200 GB
Processor 3.0 GHZ
Operating System Ubuntu11.04
Hadoop Version 0.20.0X
Data node
PC Name Dell Pentium 4
No of PCs 8Ram 1 GB
Hard disk 30 GB
Processor 3.0 GHZ
Operating System Ubuntu11.04
Hadoop Version 0.20.0X
7/31/2019 Development of Distributed Cluster & Problem Testing
12/18
Sorting using single pc
We sort different data on a single pc and then note the time consumed by sorting
process.
Sorting 1 GB data:
When we sort 1 GB data on a single pc it takes 22mins and 15 sec for sort that data.
Sorting 10 GB data:
When we sort 10 GB data on a single pc it takes 4 hours 6mins and 23 sec for sort
that data.
7/31/2019 Development of Distributed Cluster & Problem Testing
13/18
Sorting using 8 pcs Cluster
We sort different data on 8 pcs cluster and then note the time consumed by
sorting process.
Sorting 1 GB Data
When we sort 1 GB data using tarasort on multi node cluster it takes 3min and
33sec to complete its job.
Sorting 10 GB Data
When we sort 10 GB data using tarasort on multi node cluster it takes 9min and
54sec to complete its job.
7/31/2019 Development of Distributed Cluster & Problem Testing
14/18
Data size Sorting time
Sorting on Single Pc 1 GB 22min and 15 sec
10 GB 4 hour 6min and 23 sec
Sorting on cluster 1 GB 3min and 33sec
10 GB 9min and 54sec
7/31/2019 Development of Distributed Cluster & Problem Testing
15/18
0
5
10
15
20
25
1 2
single pc
8 nodecluster
Comparison Graph for Sorting 1GB Data
7/31/2019 Development of Distributed Cluster & Problem Testing
16/18
0
50
100
150
200
250
1 2
cluster
single pc
Comparison Graph for Sorting 10GB Data
7/31/2019 Development of Distributed Cluster & Problem Testing
17/18
When we sort data on cluster it reduces the time as compare to when sort data onsingle pc and When we increase the data size on cluster its time decrease as
compare to single system.
7/31/2019 Development of Distributed Cluster & Problem Testing
18/18
THANKS