In the realm of big data, Hadoop stands out as a cornerstone technology, enabling the processing of vast datasets across clusters of computers using simple programming models. Among the most fundamental, yet powerful, examples of Hadoop’s capabilities is the MapReduce job for counting words in a text, commonly known as the Wordcount example. This guide is designed to take you through the process of running a Wordcount MapReduce job in Hadoop, providing insights into both the theoretical underpinnings and practical steps.

Advertisement

What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce job usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them, and re-executes the failed tasks.

Running a Wordcount MapReduce Job in Hadoop

Prerequisites

Before diving into the MapReduce job, ensure that you have Hadoop installed and configured on your system or cluster. You should have basic knowledge of the Hadoop Distributed File System (HDFS) and the command line interface to interact with your Hadoop cluster.

Step 1: Preparing the Input

First, prepare a text file that will be used as the input for the Wordcount job. You can choose any text, such as a book or article. Place this file into HDFS using the following command:

hadoop fs -put <local-input-path> /user/hadoop/input

Step 2: Writing the MapReduce Program

The MapReduce program for Wordcount involves writing a mapper function that processes input key/value pairs to generate a set of intermediate key/value pairs, and a reducer function that merges all intermediate values associated with the same intermediate key. Here is a simplified version of what the code might look like in Java:


public class WordCount {
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
}

Step 3: Compiling and Packaging the Program

Compile the Java program and package it into a JAR file. This step varies depending on your development environment, but generally, you can use Maven or Gradle to automate the compilation and packaging processes.

You may like: A Comprehensive Guide to Creating and Running Jar Files in Java

Step 4: Running the Job

With the input data in place and your MapReduce program packaged into a JAR file, you’re ready to run the job. Execute the following command, replacing with the path to your JAR file:

hadoop jar <your-jar-file> WordCount /user/hadoop/input /user/hadoop/output

Step 5: Checking the Results

After the job completes, you can check the output in HDFS:

hadoop fs -cat /user/hadoop/output/part-r-00000

This command displays the words and their respective counts found in your input file.

Conclusion

Running a Wordcount MapReduce job in Hadoop is a quintessential example of leveraging Hadoop’s distributed data processing capabilities. This guide has walked you through the steps from preparing your input data to executing the MapReduce job. By mastering this basic example, you’ll gain a solid foundation for tackling more complex data processing tasks with Hadoop and MapReduce. Whether you’re a student, a data analyst, or a developer

Share.

2 Comments

  1. souradeep misra on

    sir,
    I install hadoop successfully by helping your tutorial .and I input text file in /user/hadoop/input/.
    But I can’t understand where i put source code(java code) for mapreduce job.
    please help me.

Exit mobile version