HDFS presented by VIJAY

Embed Size (px)

Citation preview

Hadoop Distributed File System (HDFS)SEMINAR GUIDEMr. PRAMOD PAVITHRANHEAD OF DIVISIONCOMPUTER SCIENCE & ENGINEERINGSCHOOL OF ENGINEERING, CUSATPRESENTED BY VIJAY PRATAP SINGHREG NO: 12110083S7, CS-BROLL NO: 81

CONTENTS

WHAT IS HADOOP

PROJECT COMPONENTS IN HADOOP

MAP/REDUCE

HDFS

ARCHITECTURE

WRITE & READ IN HDFS

GOALS OF HADOOP

COMPARISION WITH OTHER SYSTEMS

CONCLUSION

REFERENCES

WHAT IS HADOOP ?

WHAT IS HADOOP ?

WHAT IS HADOOP ?

WHAT IS HADOOP ?

Hadoopis anopen-sourcesoftware framework .

Hadoop framework consists on two main layers

Distributed file system (HDFS)

Execution engine (MapReduce)

Supports data-intensivedistributed applications.

Licensed under the Apache v2 license.

It enables applications to work with thousands of computation-independent computers and petabytes of data

WHY HADOOP ?

PROJECT COMPONENTS IN HADOOP

MAP/REDUCE

Hadoop is the popular open source implementation of map/reduce

MapReduceis aprogramming model for processing large data sets

MapReduce is typically used to do distributed computing onclustersof computers

MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.

The model is inspired by themap andreducefunctions

"Map" step:The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step:The master node then collects the answers to all the sub-problems and combines them in some way to form the final output

MAP REDUCE ENGINE

HDFS

Highly scalable file system

6K nodes and 120PB

Add commodity servers and disks to scale storage and IO bandwidth

Supports parallel reading & processing of dataOptimized for streaming reads/writes of large files

Bandwidth scales linearly with the number of nodes and disks

Fault tolerant & easy managementBuilt in redundancy

Tolerate disk and node failure

Automatically manages addition/removal of nodes

One operator per 3K nodes

Scalable, Reliable & Manageable

LIMITATIONS OF EXISTING DATA ANALYTICS ARCHITECTURE

BIG DATA

INCREASING BIG DATA

HADOOP'S APPROACH

HADOOP'S APPROACH

HADOOP'S APPROACH

ARCHITECTURE OF HADOOP

HADOOP MASTER/SLAVE ARCHITECTURE

ARCHITECTURE OF HDFS

ARCHITECTURE OF HDFS

CLIENT INTERACTION TO HADOOP

HDFS WRITE

Client

Rack AwarenessRack 1:DN 1Rack 2:DN7,9

Rack 1

Core SwitchSwitchSwitchF DataNode 1

DataNode 9

DataNode 7

Rack 5

BACName Node

I want to write file.txt Block AOK, Write to DataNode [1,7,9]Ready DN 7,9Ready DN 9Ready

PIPELINED WRITE

Client

Rack AwarenessRack 1:DN 1Rack 2:DN7,9

Rack 1

Core SwitchSwitchSwitchF DataNode 1

DataNode 9

DataNode 7

Rack 5

BACName Node

AAA

PIPELINED WRITE

Client

Rack AwarenessRack 1:DN 1Rack 2:DN7,9

Rack 1

Core SwitchSwitchSwitchF DataNode 1

DataNode 9

DataNode 7

Rack 5

BACName NodeAAA

Block ReceivedSuccess

MetaDataFile.txt = Block:DN: 1,7,9A

HDFS READ

Client

Rack 1

Core SwitchSwitchSwitch DataNode 1

DataNode 9

DataNode 7

Rack 5

Name Node

I want to Read file.txt Block AAvailable at DataNode [1,7,9]AAA

MetaDataFile.txt = Block:DN: 1,7,9A

HDFS SHELL COMMANDS

bin/hadoop fs -ls

bin/hadoop fs -mkdir

bin/hadoop fs -copyFromLocal

bin/hadoop fs -copyToLocal

bin/hadoop fs -moveToLocal

bin/hadoop fs -rm

bin/hadoop fs -tail

bin/hadoop fs -chmod

bin/hadoop fs -setrep -w 4 -R /dir1/s-dir/

GOALS OF HDFS

Very Large Distributed File System

10K nodes, 100 million files, 10PB

Assumes Commodity HardwareFiles are replicated to handle hardware failure

Detect failures and recover from them

Optimized for Batch ProcessingData locations exposed so that computations can move to where data resides

Provides very high aggregate bandwidth

SCALABILITY OF HADOOP

EASE TO PROGRAMMERS

HADOOP VS. OTHER SYSTEMS

HADOOP USERS

TO LEARN MORE

Source code

http://hadoop.apache.org/version_control.html

http://svn.apache.org/viewvc/hadoop/common/trunk/

Hadoop releaseshttp://hadoop.apache.org/releases.html

Contribute to ithttp://wiki.apache.org/hadoop/HowToContribute

CONCLUSION

Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data

Future secure

Hdfs has been deployed in clusters of 10 to 4k datanodes

Used in production at companies such as yahoo! , FB , Twitter , ebay

Many enterprises including financial companies use hadoop

REFERENCES

[1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing In A DBMS. In VLDB 07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 2334, 2007.

[2] Tom White, Hadoop The Definite Guide, Oreilly Media ,Third Edition, May 2012

[3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And Performance, Rice University, Houston, TX

[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo, Sunnyvale, California, USA

[5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efcient Big Data Processing In Hadoop Mapreduce , Saarland University

Thankyou.

Queries

Click to edit the title text formatClick to edit Master title style

19/08/13

Click to edit the title text formatClick to edit Master title style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles

Second level

Third level

Fourth level

Fifth level

19/08/13