2
Difference Between Hadoop 2 vs Hadoop 3 Features Hadoop 2.x Hadoop 3.x License Apache 2.0, Open Source Apache 2.0, Open Source Minimum supported version of java Minimum supported version of java is java 7. Minimum supported version of java is java 8 Fault tolerance Fault tolerance can be handled by replication (which is wastage of space) Fault tolerance can be handled by erasure coding Data Balancing For data balancing uses HDFS balancer. For data balancing uses intra datanode balancer, which is invoked via the hdfs disk balancer CLI. Storage Scheme Uses 3X replication scheme Support for erasure encoding in hdfs. Storage overhead HDFS has 200% overhead in storage space Storage overhead is only 50% Storage overhead example If there is 6 block so there will be 18 blocks occupied the space because of replication scheme. If there is 6 block so there will be 9 block occupied the space 6 block and 3 for parity. YARN timeline service Uses an old timeline service which has scalability issues. Improve the timeline service v2 and improves the scalability and reliability of timeline service. Default ports range In Hadoop 2.0 some default ports are Linux ephemeral port range. So at the time of startup they will be fail to bind. But in hadoop 3.0 these ports have been moved out of the ephemeral range. Tools Uses Hive, pig, Giraph and other hadoop tools Hive, pig, Tez, Hama, Giraph and other hadoop tools are available. Compatible file system HDFS (Default FS), FTP File system: This stores all its data on remotely accessible FTP servers. Amazon S3 (Simple Storage Service) file system Windows Azure Storage Blobs (WASB) file system. It supports all the previous one as well as Microsoft Azure Data Lake filesystem. Datanode Resources Datanode resource is not dedicated for the mapreduce we can use it for other application. Here also datanode resources can be used for other Applications too MR API compatibity MR API compatible with hadoop 1.x program to execute on hadoop 2.X Here also MR API is compatible with running hadoop 1.x programs to execute on hadoop 3.X support for Microsoft windows It can be deployed on windows it also supports for Microsoft windows Slots / container Hadoop 1 works on concept of slots but hadoop 2.X works on the concept of the container. Through in the container we can run generic task. It also works on the concept of container. Single point of failure Has Features to overcome SPOF so whenever Namenode fails it recovers automatically Has Feature to overcome SPOF so whenever Namenode fail it recovers automatically no needs manual intervention to overcome it

Difference between hadoop 2 vs hadoop 3

Embed Size (px)

Citation preview

Page 1: Difference between hadoop 2 vs hadoop 3

Difference Between Hadoop 2 vs Hadoop 3

Features Hadoop 2.x Hadoop 3.x

License Apache 2.0, Open Source Apache 2.0, Open Source

Minimumsupportedversion of java

Minimum supported version of java is java 7. Minimum supported version of java is java 8

Fault tolerance Fault tolerance can be handled by replication(which is wastage of space)

Fault tolerance can be handled by erasurecoding

Data Balancing For data balancing uses HDFS balancer.For data balancing uses intra datanodebalancer, which is invoked via the hdfs diskbalancer CLI.

StorageScheme Uses 3X replication scheme Support for erasure encoding in hdfs.

Storageoverhead HDFS has 200% overhead in storage space Storage overhead is only 50%

Storageoverheadexample

If there is 6 block so there will be 18 blocksoccupied the space because of replicationscheme.

If there is 6 block so there will be 9 blockoccupied the space 6 block and 3 for parity.

YARN timelineservice

Uses an old timeline service which hasscalability issues.

Improve the timeline service v2 and improvesthe scalability and reliability of timeline service.

Default portsrange

In Hadoop 2.0 some default ports are Linuxephemeral port range. So at the time of startupthey will be fail to bind.

But in hadoop 3.0 these ports have been movedout of the ephemeral range.

Tools Uses Hive, pig, Giraph and other hadoop tools Hive, pig, Tez, Hama, Giraph and other hadooptools are available.

Compatible filesystem

HDFS (Default FS), FTP File system: Thisstores all its data on remotely accessible FTPservers. Amazon S3 (Simple Storage Service)file system Windows Azure Storage Blobs(WASB) file system.

It supports all the previous one as well asMicrosoft Azure Data Lake filesystem.

DatanodeResources

Datanode resource is not dedicated forthe mapreduce we can use it for otherapplication.

Here also datanode resources can be used forother Applications too

MR APIcompatibity

MR API compatible with hadoop 1.x program toexecute on hadoop 2.X

Here also MR API is compatible with runninghadoop 1.x programs to execute on hadoop 3.X

support forMicrosoftwindows

It can be deployed on windows it also supports for Microsoft windows

Slots / container

Hadoop 1 works on concept of slots buthadoop 2.X works on the concept of thecontainer. Through in the container we can rungeneric task.

It also works on the concept of container.

Single point offailure

Has Features to overcome SPOF so wheneverNamenode fails it recovers automatically

Has Feature to overcome SPOF so wheneverNamenode fail it recovers automatically noneeds manual intervention to overcome it

Page 2: Difference between hadoop 2 vs hadoop 3

HDFSFederation

In hadoop 1.0­ only single NameNode tomanage all Namespace but in Hadoop 2.0­mutiple NameNode for Mutiple Namespace

Hadoop 3.x also have multiple Namenode formultiple namespace

Scalibility we can scale up to 10000 Nodes per clusterBetter scalability. we can scale more than10000 nodes per cluster

Faster access todata

due to data Node caching we can fast accessthe data

Here also through Datanode caching we canfast access the data

HDFS snapshotHadoop 2 adds the support for snapshot. itprovides disaster recovery and protection foruser error

Haddop 2 also support for the snapshot feature.

platformCan serve as a platform for a wide variety ofdata analytics possible to run event processing,streaming and real time operations.

Here also it is possible to run event processing,streaming and real time operation on the top ofYarn

ClusterResourceManagement

For cluster resource Management ituses YARN. It improves scalability, highavailability, Multi­tenancy.

For cluster resource Management Uses YARN,with all the features