Compression in Hadoop
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network, or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.
Some of the compression formats used in Hadoop
Compression Format
|
Tool
|
Algorithm
|
Filename Extension
|
Splittable
|
DEFLATE
|
NA
|
DEFLATE
|
.deflate
|
No
|
gzip
|
gzip
|
DEFLATE
|
.gz
|
No
|
bzip2
|
bzip2
|
bzip2
|
.bz2
|
Yes
|
LZO
|
lzop
|
LZO
|
.lzo
|
No
|
Snappy
|
NA
|
Snappy
|
.snappy
|
No
|
Codecs
A codec is
the implementation of a compression-decompression algorithm and in Hadoop, it
is represented by an implementation of the CompressionCodec
interface.
Compression Format
|
Hadoop CompressionCodec
|
DEFLATE
|
org.apache.hadoop.io.compress.DefaultCodec
|
gzip
|
org.apache.hadoop.io.compress.GzipCodec
|
bzip2
|
org.apache.hadoop.io.compress.BZip2Codec
|
LZO
|
com.hadoop.compression.lzo.LzopCodec
|
Snappy
|
org.apache.hadoop.io.compress.SnappyCodec
|
Comments
Post a Comment