Development of Distributed Cluster & Problem Testing

Embed Size (px)

Citation preview

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    1/18

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    2/18

    Presented by:Syed Shabi-ul-hasnain Nazir

    Supervised by:Sir Tahir Roshmi

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    3/18

    Distributed Computing:

    Distributed computing is a field of computer science that studies distributed

    systems.

    A distributed system consists of multiple computers that communicate through a

    computer network.

    The computers interact with each other in order to achieve a common goal.

    In distributed computing, a problem is divided into many tasks, each of which is

    solved by one or more computers.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    4/18

    Paradigm means a pattern, example, or model.

    There are different types of Paradigms.

    The Message Passing Paradigm.

    The Client-Server Paradigm.

    The Peer-to-Peer System Architecture.

    The Message System Paradigm.

    Remote Procedure Call.

    The Mobile Agent Paradigm.

    Groupware Paradigm.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    5/18

    Hadoop is a software framework that enables distributed manipulation of largeamounts of data. Hadoop does this in a way that makes it reliable, and efficient.

    Hadoop is reliable because it assumes that computing elements and storage will fail

    and, therefore, it maintains several copies of working data to ensure that processing

    can be redistributed around failed nodes.

    Hadoop is efficient because it works on the principle of parallelization, allowing

    data to process in parallel to increase the processing speed.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    6/18

    Hadoop is made up of a number of elements.

    Hadoop Distributed File System (HDFS)

    Map Reduce

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    7/18

    HDFS architecture is built from a collection of special nodes.

    There are two types of node in hdfs.

    1. Name node

    2. Data node

    The Name Node (there is only one), which provides metadata services within

    HDFS, and the Data Node, which serves storage blocks for HDFS.

    Files stored in HDFS in the form blocks and the size of one block is 64 MB.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    8/18

    Map Reduce is itself a software framework for the parallel processing of large datasets across a distributed cluster of processors or stand-alone computers.

    It consists of two operations.

    The Map function takes a set of data and transforms it into a list of key/valuepairs, one per element of the input domain.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    9/18

    The Reduce function takes the list that resulted from the Map function andreduces the list of key/value pairs based on their key (a single key/value pairresults for each key).

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    10/18

    Single pc

    PC Name Dell Gx 620 Pentium 4

    No of PCs 1

    Ram 2 GB

    Hard disk 30 GB

    Processor 3.4 GHZ

    Operating System Ubuntu 11.04

    Software eclipse-SDK-3.6.1

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    11/18

    Cluster pcs

    Name node

    PC Name Core 2 due

    No of PCs 1

    Ram 2 GB

    Hard disk 200 GB

    Processor 3.0 GHZ

    Operating System Ubuntu11.04

    Hadoop Version 0.20.0X

    Data node

    PC Name Dell Pentium 4

    No of PCs 8Ram 1 GB

    Hard disk 30 GB

    Processor 3.0 GHZ

    Operating System Ubuntu11.04

    Hadoop Version 0.20.0X

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    12/18

    Sorting using single pc

    We sort different data on a single pc and then note the time consumed by sorting

    process.

    Sorting 1 GB data:

    When we sort 1 GB data on a single pc it takes 22mins and 15 sec for sort that data.

    Sorting 10 GB data:

    When we sort 10 GB data on a single pc it takes 4 hours 6mins and 23 sec for sort

    that data.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    13/18

    Sorting using 8 pcs Cluster

    We sort different data on 8 pcs cluster and then note the time consumed by

    sorting process.

    Sorting 1 GB Data

    When we sort 1 GB data using tarasort on multi node cluster it takes 3min and

    33sec to complete its job.

    Sorting 10 GB Data

    When we sort 10 GB data using tarasort on multi node cluster it takes 9min and

    54sec to complete its job.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    14/18

    Data size Sorting time

    Sorting on Single Pc 1 GB 22min and 15 sec

    10 GB 4 hour 6min and 23 sec

    Sorting on cluster 1 GB 3min and 33sec

    10 GB 9min and 54sec

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    15/18

    0

    5

    10

    15

    20

    25

    1 2

    single pc

    8 nodecluster

    Comparison Graph for Sorting 1GB Data

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    16/18

    0

    50

    100

    150

    200

    250

    1 2

    cluster

    single pc

    Comparison Graph for Sorting 10GB Data

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    17/18

    When we sort data on cluster it reduces the time as compare to when sort data onsingle pc and When we increase the data size on cluster its time decrease as

    compare to single system.

  • 7/31/2019 Development of Distributed Cluster & Problem Testing

    18/18

    THANKS