Apache Hadoop MapReduce

MapReduce


Before drive into this tutorial, I suggest you to read what is Bigdata, Hadoop, HDFS and YARN. If you not aware of those topics.


What is MapReduce

MapReduce is an Apache framework used to process large amount of data in parallel which is presented on hadoop cluster. It's will do the job as divide and conquer manner.

There are two key components in Hadoop MapReduce.

Component

  1. Mapper and
  2. Reducer

Mapper

It takes input from input split and process the each input split. The result of processed input split will be collection of key, value pairs. The result will be persist on the local disk. The number of mapper will be desired based on input split. How many number of inputsplit is there, those many number of mapper will be run.

Reducer

It takes Mapper output as a input and process that intermediate result (collection of key, value pairs) and combine those key, value pairs and create a smaller set of collection (Key, value pairs). The final output will be produced by reducer only. We can configure the number of reducers. The number of output file will be create based on reducer configuration. How many number of reducer is there those many number of output file will be generated.

Apache Hadoop MapReduce Workflow

Inputsplit

The piece of data from input file. The piece is segregate by anything like splitting by space, comma, semicolon, new line, tab space and etc..

Mapping

We already learned in above component topic.

Shuffling

This is the phase where trying to combine all values associated to single identical key. How many number of key is there in mapping phase those many number of shuffling will happen.

Reducing

We already learned in above component topic.

Final output

The final output will be collection of key, value pairs. It will be persist in hdfs storage. 
 

What next?

In next post we will see about Apache Hive.

Post your queries in comment section :)

Comments

Post a Comment

Popular posts from this blog

HDFS Commands Part - I

HDFS Commands Part - II

Install Hadoop On Ubuntu