Installing Cloudera’s Hadoop Distribution CDH4 to a Virtual Machine

In this post I shall explain how to install Cloudera’s Hadoop distribution CDH4 to a virtual machine running Ubuntu 12.04 LTS from scratch. The second part can be used to install Hadoop natively to you Ubuntu 12.04 installation. However, this tutorial was written with a development or experimental environment in mind.

Requirements:

  1. A Computer with at least 4GB of RAM (8GB+ recommended), as the virtualization of Hadoop consumes a lot of memory.
  2. 64 bit processor with activated Hardware virtualisation in BIOS (If you don’t have a 64 bit processor, you won’t be able to use Ubuntu with CDH 4 and have to choose another OS)

 Virtual machine installation:

  1. Download the .iso-file for Ubuntu 12.04 LTS 64 bit from www.ubuntu.com.
  2. Next, download Virtualbox for your host operation system, e.g. Windows 7. The host operating system is the system running Virtualbox, whereas the virtual operating system is called guest system.
    You will find the binaries for your specific operating system with further installation instructions at https://www.virtualbox.org/wiki/Downloads.
  3. After installation, open Virtualbox and create a new virtual machine. Call it Ubuntu Hadoop or any other name and select Linux, Ubuntu 64bit as operating system.
  4. In the next step, you will have to create a new virtual hard disk. I’d recommend a minimum of 20GB in VDI format with dynamic allocation. Dynamic allocation means that the file on your host disc won’t be a fixed size but will grow with the hard disc of the guest system. However, it will be limited to a fixed size, i.e. 20GB. Initially this will be slower than a fixed size disc, while consuming less space on the host system.
  5. Once you have created the virtual machine, start your virtual machine by double clicking on the name. You can now choose the downloaded Ubuntu .iso-file as start disk.
  6. This will open the Ubuntu installer. Follow the on screen instructions to install Ubuntu to the virtual machine. Visit the Ubuntu Homepage for further help with the installation.

Prerequesites to installing CDH4

  1. Install open ssh server package.
    sudo apt-get install openssh-server
  2. Set a password for root. WARNING: This is only set for this virtual machine. In production environments, this could present a security risk.
    sudo passwd
  3. Edit /etc/hosts in a editor with root privileges. Uncomment the second line. It should look something like this.
    127.0.0.1    antony-VirtualBox localhost
    #127.0.1.1    antony-VirtualBox
    
    # The following lines are desirable for IPv6 capable hosts
    #::1     ip6-localhost ip6-loopback
    #fe00::0 ip6-localnet
    #ff00::0 ip6-mcastprefix
    #ff02::1 ip6-allnodes 
    #ff02::2 ip6-allrouters

Installing Cloudera Hadoop Distribution (CDH4)

  1.  Go to www.cloudera.com and locate Products – CDH. Click ‘Download and Install CDH 4′. On the next page, click ‘Download and Install CDH 4 automatically’. On the following page, under Cloudera Manager 4.1.1, click ‘download‘. Save the .bin-file to disk.
    Cloudera Homepage
  2. Make the installer executable and run it as root.
    chmod u+x ./cloudera-manager-installer.bin
    sudo ./cloudera-manager-installer.bin
  3. Accept the licenses and follow the on-screen instructions. It is important to be patient! The installer may seem to have crashed at times, however, it simply takes its time to install. At the end of the installation a browser should open.
  4. Log in with the credentials ‘admin’, ‘admin’.
  5. You can now add hosts to your cluster by clicking ‘Hosts’ and then add hosts. Enter localhost or 127.0.0.1 as IP. In the process you can choose whether to install YARN or MRv1. Be sure to select the latter.
  6. After the installation and configuration of your cluster, you can access your running services under ‘services’.
  7. You have successfully installed a one-node cluster on the virtual Ubuntu machine.

Using counters in Hadoop MapReduce

Sometimes when running MapReduce jobs, you want to know whether or how often a certain event has occured during execution. Imagine an iterative algorithm that should run until no changes were made to the data during execution. Let’s assume that the change happens in the map function.

A common mistake would be to use the context object and set a value in the configuration object.

context.getConfiguration().set("event", "hasOccured");

This approach only works when executing the job on a single machine. When running on a cluster of computers each mapper or reducer will have its own configuration object. Therefore, global reading and writing is not possible. It can only be used to store information before executing the job and passing this information to the mappers and reducers, e.g. filenames of auxiliary files, etc.

In order to know how often a mapper has changed a data item on a global level, we will use a so-called counter. First you have to define an Enum which represents a group of counters. In this example we will only have one counter.

public enum MyCounters {
Counter
}

From within the map method of our mapper, we can access the counter and increment it when we change a dataset.The counter is identified by the enum value.

context.getCounter(MyCounters.Counter).increment(1);

Finally, we can read the counter after job execution and see whether the data has changed.

job.getCounters().findCounter(MyCounters.Counter).getValue();

All counters are displayed during job execution.

Summary: Counters are a useful feature provided by the hadoop framework to globally certain values during job execution. They can also be analyzed to count how many damaged or malformed datasets were in the input data.