If you can't read please download the document
Upload
thevijayps
View
722
Download
0
Embed Size (px)
Citation preview
Hadoop Distributed File System (HDFS)SEMINAR GUIDEMr. PRAMOD PAVITHRANHEAD OF DIVISIONCOMPUTER SCIENCE & ENGINEERINGSCHOOL OF ENGINEERING, CUSATPRESENTED BY VIJAY PRATAP SINGHREG NO: 12110083S7, CS-BROLL NO: 81
CONTENTS
WHAT IS HADOOP
PROJECT COMPONENTS IN HADOOP
MAP/REDUCE
HDFS
ARCHITECTURE
WRITE & READ IN HDFS
GOALS OF HADOOP
COMPARISION WITH OTHER SYSTEMS
CONCLUSION
REFERENCES
WHAT IS HADOOP ?
WHAT IS HADOOP ?
WHAT IS HADOOP ?
WHAT IS HADOOP ?
Hadoopis anopen-sourcesoftware framework .
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
Supports data-intensivedistributed applications.
Licensed under the Apache v2 license.
It enables applications to work with thousands of computation-independent computers and petabytes of data
WHY HADOOP ?
PROJECT COMPONENTS IN HADOOP
MAP/REDUCE
Hadoop is the popular open source implementation of map/reduce
MapReduceis aprogramming model for processing large data sets
MapReduce is typically used to do distributed computing onclustersof computers
MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.
The model is inspired by themap andreducefunctions
"Map" step:The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node.
"Reduce" step:The master node then collects the answers to all the sub-problems and combines them in some way to form the final output
MAP REDUCE ENGINE
HDFS
Highly scalable file system
6K nodes and 120PB
Add commodity servers and disks to scale storage and IO bandwidth
Supports parallel reading & processing of dataOptimized for streaming reads/writes of large files
Bandwidth scales linearly with the number of nodes and disks
Fault tolerant & easy managementBuilt in redundancy
Tolerate disk and node failure
Automatically manages addition/removal of nodes
One operator per 3K nodes
Scalable, Reliable & Manageable
LIMITATIONS OF EXISTING DATA ANALYTICS ARCHITECTURE
BIG DATA
INCREASING BIG DATA
HADOOP'S APPROACH
HADOOP'S APPROACH
HADOOP'S APPROACH
ARCHITECTURE OF HADOOP
HADOOP MASTER/SLAVE ARCHITECTURE
ARCHITECTURE OF HDFS
ARCHITECTURE OF HDFS
CLIENT INTERACTION TO HADOOP
HDFS WRITE
Client
Rack AwarenessRack 1:DN 1Rack 2:DN7,9
Rack 1
Core SwitchSwitchSwitchF DataNode 1
DataNode 9
DataNode 7
Rack 5
BACName Node
I want to write file.txt Block AOK, Write to DataNode [1,7,9]Ready DN 7,9Ready DN 9Ready
PIPELINED WRITE
Client
Rack AwarenessRack 1:DN 1Rack 2:DN7,9
Rack 1
Core SwitchSwitchSwitchF DataNode 1
DataNode 9
DataNode 7
Rack 5
BACName Node
AAA
PIPELINED WRITE
Client
Rack AwarenessRack 1:DN 1Rack 2:DN7,9
Rack 1
Core SwitchSwitchSwitchF DataNode 1
DataNode 9
DataNode 7
Rack 5
BACName NodeAAA
Block ReceivedSuccess
MetaDataFile.txt = Block:DN: 1,7,9A
HDFS READ
Client
Rack 1
Core SwitchSwitchSwitch DataNode 1
DataNode 9
DataNode 7
Rack 5
Name Node
I want to Read file.txt Block AAvailable at DataNode [1,7,9]AAA
MetaDataFile.txt = Block:DN: 1,7,9A
HDFS SHELL COMMANDS
bin/hadoop fs -ls
bin/hadoop fs -mkdir
bin/hadoop fs -copyFromLocal
bin/hadoop fs -copyToLocal
bin/hadoop fs -moveToLocal
bin/hadoop fs -rm
bin/hadoop fs -tail
bin/hadoop fs -chmod
bin/hadoop fs -setrep -w 4 -R /dir1/s-dir/
GOALS OF HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10PB
Assumes Commodity HardwareFiles are replicated to handle hardware failure
Detect failures and recover from them
Optimized for Batch ProcessingData locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth
SCALABILITY OF HADOOP
EASE TO PROGRAMMERS
HADOOP VS. OTHER SYSTEMS
HADOOP USERS
TO LEARN MORE
Source code
http://hadoop.apache.org/version_control.html
http://svn.apache.org/viewvc/hadoop/common/trunk/
Hadoop releaseshttp://hadoop.apache.org/releases.html
Contribute to ithttp://wiki.apache.org/hadoop/HowToContribute
CONCLUSION
Hdfs provides a reliable, scalable and manageable solution for working with huge amounts of data
Future secure
Hdfs has been deployed in clusters of 10 to 4k datanodes
Used in production at companies such as yahoo! , FB , Twitter , ebay
Many enterprises including financial companies use hadoop
REFERENCES
[1] M. Zukowski, S. Heman, N. Nes, And P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing In A DBMS. In VLDB 07: Proceedings Of The 33rd International Conference On Very Large Data Bases, Pages 2334, 2007.
[2] Tom White, Hadoop The Definite Guide, Oreilly Media ,Third Edition, May 2012
[3] Jeffrey Shafer, Scott Rixner, And Alan L. Cox, The Hadoop Distributed Filesystem: Balancing Portability And Performance, Rice University, Houston, TX
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo, Sunnyvale, California, USA
[5] Jens Dittrich, Jorge-arnulfo Quian, E-ruiz, Information Systems Group, Efcient Big Data Processing In Hadoop Mapreduce , Saarland University
Thankyou.
Queries
Click to edit the title text formatClick to edit Master title style
19/08/13
Click to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
19/08/13