Compression in Hadoop

December 30, 2013

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network, or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop.

Some of the compression formats used in Hadoop

Compression Format	Tool	Algorithm	Filename Extension	Splittable
DEFLATE	NA	DEFLATE	.deflate	No
gzip	gzip	DEFLATE	.gz	No
bzip2	bzip2	bzip2	.bz2	Yes
LZO	lzop	LZO	.lzo	No
Snappy	NA	Snappy	.snappy	No

Codecs

A codec is the implementation of a compression-decompression algorithm and in Hadoop, it is represented by an implementation of the CompressionCodec interface.

Compression Format	Hadoop CompressionCodec
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

Search This Blog

Do You Make These Mistakes?

Compression in Hadoop

Comments

Post a Comment

Popular posts from this blog

[SOLVED] - RSYNC not executing via CRON

Reloading image in Word Document automatically.

RSYNC command without authentication - 8 simple steps