Wednesday, December 11, 2013

The Hadoop Distributed Filesystem

Design of HDFS

HDFS is a filesystem designed for

  • Very large files - Files that are of hundereds of MB, GB or TB. Hadoop clusters running today stores petabytes of data.
  • Streaming data access - write once, read many times pattern
  • Commodity hardware - Hadoop doesn’t require expensive, highly reliable hardware to run on.

The applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today

  • Low-latency data access
  • Lots of small files
  • Multiple writers, arbitrary file modifications


HDFS has the concept of a block, but it is a much larger unit—64 MB by default. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

HDFS’s fsck command understands blocks. For example
% hadoop fsck / -files -blocks
will list the blocks that make up each file in the filesystem.

Namenodes and Datanodes

An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.

Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

The file system cannot be used without namenode. If namenode goes down, there would be no way of knowing how to reconstruct the files from the blocks on the datanodes. 

There are 2 ways to recover

  • Run a secondary name node
  • Backup the files that make up the persistant state of the filesystem matadata and ensure that the writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount.

No comments:

Post a Comment