Posts

Showing posts from December, 2013

Compression in Hadoop

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network, or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. Some of the compression formats used in Hadoop Compression Format Tool Algorithm Filename Extension Splittable DEFLATE NA DEFLATE .deflate No gzip gzip DEFLATE .gz No bzip2 bzip2 bzip2 .bz2 Yes LZO lzop LZO .lzo No Snappy NA Snappy .snappy No Codecs A codec is the implementation of a compression-decompression algorithm and in Hadoop, it is represented by an implementation of the CompressionCodec interface. Compression Format Hadoop CompressionCodec DEFLATE org.apac

The Hadoop Distributed Filesystem

Design of HDFS HDFS is a filesystem designed for Very large files - Files that are of hundereds of MB, GB or TB. Hadoop clusters running today stores petabytes of data. Streaming data access - write once, read many times pattern Commodity hardware - Hadoop doesn’t require expensive, highly reliable hardware to run on. The applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today Low-latency data access Lots of small files Multiple writers, arbitrary file modifications Blocks HDFS has the concept of a block, but it is a much larger unit—64 MB by default. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks opera