Posts

Compression in Hadoop

File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network, or to or from disk. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. Some of the compression formats used in Hadoop Compression Format Tool Algorithm Filename Extension Splittable DEFLATE NA DEFLATE .deflate No gzip gzip DEFLATE .gz No bzip2 bzip2 bzip2 .bz2 Yes LZO lzop LZO .lzo No Snappy NA Snappy .snappy No Codecs A codec is the implementation of a compression-decompression algorithm and in Hadoop, it is represented by an implementation of the CompressionCodec interface. Compression Format Hadoop CompressionCodec DEFLATE org.apac...

The Hadoop Distributed Filesystem

Design of HDFS HDFS is a filesystem designed for Very large files - Files that are of hundereds of MB, GB or TB. Hadoop clusters running today stores petabytes of data. Streaming data access - write once, read many times pattern Commodity hardware - Hadoop doesn’t require expensive, highly reliable hardware to run on. The applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today Low-latency data access Lots of small files Multiple writers, arbitrary file modifications Blocks HDFS has the concept of a block, but it is a much larger unit—64 MB by default. HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks opera...

Failed to load Main-Class manifest attribute from HelloWorld.jar - SOLVED

Image
When i try to compile a jar file using the below command in command prompt, java -jar HelloWorld.jar i got an error like Failed to load Main-Class manifest attribute from HelloWorld.jar This is due to the missing launch configuration.  The Main-Class header needs to be in the manifest for the JAR file - this is metadata about things like other required libraries. See the  Sun documentation  for how to create an appropriate manifest. Simply, i followed the eclipse for exporting the jar file instead of remembering all the commands. and choose as specified below. and choose the following options below. 1. Choose your class that contains MAIN method. 2. Choose the destination of Jar file 3. Once, one and two steps are done, Click Finish. Now run the same command via command prompt,  java -jar HelloWorld.jar Thi...

Deleting files with SIZE range

The  -a  in an explicit  AND  operator that allows you to join two primaries. In this case creating a range using  -size . rm -rf `find . -size +300c -a -size -400c`; The above command deletes the files which size are in between 300kb to 400kb. Note the size is a numeric argument that can optionally be appended with  +  and  - .  Numeric arguments can be specified as     +n      for  greater than n,    -n      for  less than n,     n      for  exactly n.

HADOOP - Installation setup

Prerequisites Hadoop requires a working Java 1.5+ (aka Java 5) installation. However, using Java 1.6/1.7 (aka Java 6/7) is recommended for running Hadoop. Please refer to jdk installation instructions here. Dedicated user for Hadoop system A dedicated Hadoop user will help Hadoop installation from other software applications and user accounts running on the same machine. umasarath@ubuntu:~$ sudo addgroup hadoop umasarath@ubuntu:~$ sudo adduser --ingroup hadoop hduser The above two commands will add "hduser" user and "hadoop" group. Hadoop Installation Download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. The folder I chosen was the hduser home folder. Extract the downloaded file in /home/hduser folder and make sure the file should be extracted in hduser login. hduser@ubuntu:~$ sudo tar xzf hadoop-1.2.1.tar.gz hduser@ubuntu:~$ sudo mv hadoop-1.2.1 hadoop Configuratio...

Install JDK on Ubuntu

Installing Open JDK from Command Prompt Issue command apt-get install openjdk-7-jdk to install JDK7. Ubuntu will auto download JDK and start the installation, wait a few minutes for the downloading process. umasarath @ ubuntu:~ $ sudo apt-get install openjdk- 7 -jdk Verifying Java after installation Ubuntu installs JDK at /usr/lib/jvm/jdk-folder, for example /usr/lib/jvm/java-7-openjdk-amd64/. In additional, Ubuntu also puts the JDK bin folder in the system path, via symbolic link.  For example, /usr/bin/java. To verify if JDK is installed properly, type java -version in the command prompt. umasarath@ubuntu:~$java-version java version "1.7.0_25" OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2) OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) umasarath@ubuntu:/usr/lib/jvm/java-7-openjdk-amd64/bin$ Post-Installation Setup To configure JAVA_HOME in system path each time the terminal is started, you can append the expor...

[SOLVED] - RSYNC not executing via CRON

The below error was faced and tried for online answers, made my head to heat for couple of days and finally i was able to crack the solution for the below issue. umasarath@ubuntu:~$ tail -50 cron_alc.log Cronjob started for back-up files + rsync -v umasarath@xx.xx.xx.xx:/tmp/compressed_logfiles/*.zip /home/umasarath/archive/ Permission denied, please try again. Permission denied, please try again. Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7] umasarath@ubuntu:~$ Please follow the below steps inorder to avoid the above error. My script contains the below code. umasarath@ubuntu:~$ cat transfer_files.sh #!/bin/sh set -xv echo "Cronjob started for back-up files" `date` /usr/bin/rsync -vv umasarath@xx.xx.xx.xx:/tmp/compressed_logfiles/*.zip /home/umasarath/archive/ echo "Cronjob ended for back-u...