Using counters in Hadoop MapReduce

Sometimes when running MapReduce jobs, you want to know whether or how often a certain event has occured during execution. Imagine an iterative algorithm that should run until no changes were made to the data during execution. Let’s assume that the change happens in the map function.

A common mistake would be to use the context object and set a value in the configuration object.

context.getConfiguration().set("event", "hasOccured");

This approach only works when executing the job on a single machine. When running on a cluster of computers each mapper or reducer will have its own configuration object. Therefore, global reading and writing is not possible. It can only be used to store information before executing the job and passing this information to the mappers and reducers, e.g. filenames of auxiliary files, etc.

In order to know how often a mapper has changed a data item on a global level, we will use a so-called counter. First you have to define an Enum which represents a group of counters. In this example we will only have one counter.

public enum MyCounters {
Counter
}

From within the map method of our mapper, we can access the counter and increment it when we change a dataset.The counter is identified by the enum value.

context.getCounter(MyCounters.Counter).increment(1);

Finally, we can read the counter after job execution and see whether the data has changed.

job.getCounters().findCounter(MyCounters.Counter).getValue();

All counters are displayed during job execution.

Summary: Counters are a useful feature provided by the hadoop framework to globally certain values during job execution. They can also be analyzed to count how many damaged or malformed datasets were in the input data.