Apache Hadoop MapReduce

October 01, 2019

Before drive into this tutorial, I suggest you to read what is Bigdata, Hadoop, HDFS and YARN. If you not aware of those topics.

What is MapReduce

MapReduce is an Apache framework used to process large amount of data in parallel which is presented on hadoop cluster. It's will do the job as divide and conquer manner.

There are two key components in Hadoop MapReduce.

Component

Mapper and
Reducer

Mapper

It takes input from input split and process the each input split. The result of processed input split will be collection of key, value pairs. The result will be persist on the local disk. The number of mapper will be desired based on input split. How many number of inputsplit is there, those many number of mapper will be run.

Reducer

It takes Mapper output as a input and process that intermediate result (collection of key, value pairs) and combine those key, value pairs and create a smaller set of collection (Key, value pairs). The final output will be produced by reducer only. We can configure the number of reducers. The number of output file will be create based on reducer configuration. How many number of reducer is there those many number of output file will be generated.