Monday, December 30, 2013

Compression in Hadoop


File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network, or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.

Some of the compression formats used in Hadoop

Compression Format
Tool
Algorithm
Filename Extension
Splittable
DEFLATE
NA
DEFLATE
.deflate
No
gzip
gzip
DEFLATE
.gz
No
bzip2
bzip2
bzip2
.bz2
Yes
LZO
lzop
LZO
.lzo
No
Snappy
NA
Snappy
.snappy
No

Codecs

A codec is the implementation of a compression-decompression algorithm and in Hadoop, it is represented by an implementation of the CompressionCodec interface.


Compression Format
Hadoop CompressionCodec
DEFLATE
org.apache.hadoop.io.compress.DefaultCodec
gzip
org.apache.hadoop.io.compress.GzipCodec
bzip2
org.apache.hadoop.io.compress.BZip2Codec
LZO
com.hadoop.compression.lzo.LzopCodec
Snappy
org.apache.hadoop.io.compress.SnappyCodec









No comments:

Post a Comment