Apache Hadoop MapReduce
Before drive into this tutorial, I suggest you to read what is Bigdata, Hadoop, HDFS and YARN. If you not aware of those topics.
What is MapReduce
MapReduce is an Apache framework used to process large amount of data in parallel which is presented on hadoop cluster. It's will do the job as divide and conquer manner.There are two key components in Hadoop MapReduce.
Component
- Mapper and
- Reducer
Mapper
It takes input from input split and process the each input split. The result of processed input split will be collection of key, value pairs. The result will be persist on the local disk. The number of mapper will be desired based on input split. How many number of inputsplit is there, those many number of mapper will be run.
Reducer
It takes Mapper output as a input and process that intermediate result (collection of key, value pairs) and combine those key, value pairs and create a smaller set of collection (Key, value pairs). The final output will be produced by reducer only. We can configure the number of reducers. The number of output file will be create based on reducer configuration. How many number of reducer is there those many number of output file will be generated.
Apache Hadoop MapReduce Workflow
Inputsplit
The piece of data from input file. The piece is segregate by anything like splitting by space, comma, semicolon, new line, tab space and etc..
Mapping
We already learned in above component topic.
Shuffling
This is the phase where trying to combine all values associated to single identical key. How many number of key is there in mapping phase those many number of shuffling will happen.
Reducing
We already learned in above component topic.
Final output
The final output will be collection of key, value pairs. It will be persist in hdfs storage.
Excepting more from you.
ReplyDeleteBig Data Hadoop Online Training
Please post your expectation/questions?
Delete