HDFS Internals

Preview:

Citation preview

HDFS InternalsBhupesh Chawda

bhupesh@apache.org

DataTorrent

Image Source: https://help.marklogic.com/news/list/Index/10

Agenda

What are Blocks?A physical storage disk has a block size - minimum amount of data it

can read or write. Normally 512 bytes.

File systems for a single disk also deal with data in blocks. Normally few kilo bytes (4 kb).

Hadoop has a much larger block size. By default it is 64 mb.

Files in HDFS are broken down into block sized chunks and are stored as independent units.

However, files smaller than a block size do not occupy the entire block.

Should I care?

Why so large blocks?Minimize disk seek times

Assuming 10 ms of seek time, and 100 MB/s as disk transfer rate, if block size if 100 MB, then seek time is 1% of transfer time which is small enough to ignore.

Hence default is 64 MB while many production environments also use 128 MB.

HDFS Architecture

Image Source: https://hadoop.apache.org

Namenode and DatanodeMaster - Namenode

Manages file system namespace

File system tree and metadata for all files and directories

Stores this info in -

Namespace image

Edit log

Knows for a given file which datanodes has the corresponding blocks. Reconstructed at startup

Worker - DatanodeStore and retrieve blocks as requested by clients

Periodically report back to the namenode on the list of blocks they are storing

HDFS Storage

Image Source: https://developer.yahoo.com/hadoop/tutorial/module2.html

Secondary Namenode

Image Source: http://www.quickmeme.com/meme/35ke38

Secondary NamenodeNot a backup namenode

Periodically merge the namespace image with the edit log, if edit log becomes too large

Usually runs on a different machine than the namenode

The secondary however always lags behind primary and hence the merged copy cannot be used in case of primary failure

In event of primary failure, copy the primary namespace image to the secondary and run it as the new primary.

Writing a File in HDFS

Reading a file in HDFS

HDFS Block Placement

Small File Problem?

Each file occupies namespace irrespective of file size!!

Further ReadingHDFS Comics :-) https://docs.google.com/open?id=0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1

Sample:

Thank You!!

Please send your questions at:bhupesh@apache.org / bhupesh@datatorrent.com