What is Hadoop


What is Hadoop?

Before entering into Hadoop, we have to understand the issues related to Big Data and traditional processing system. In the previous blog, we already discussed about Big Data in detail. Hadoop is an open-source batch processing framework developed in Java, used to store and analyze the large sets of data. It is being used by Google, Yahoo, Facebook, Twitter, LinkedIn and etc.

Components of Hadoop

  •  HDFS - Hadoop Distributed File System, Used to store huge amount of dataset across the cluster.
  •  YARN - Yet Another Resource Negotiator, Used for managing the cluster.
  •  MapReduce - It's a software framework to process huge data in parallel on a cluster.

Features of Hadoop 

  •   Cluster storage - Hadoop will split the single data set into multiple and stored across cluster (more than one storage system working together) with replication (default 3). It's used to increase performance and reliability.
  •   Distributed computing - A single problem divides into multiple sub-problem and each sub-problem solve by different computer. Each computer will communicate internally to avoid duplication. If all sub-problem done the solution will come as single. 
  •   Commodity hardware - Commodity hardware is nothing but cheap hardware, which is inexpensive and more availability. We can use our traditional systems on the cluster. No need to buy expensive system.
  •   Parallel processing - A single task will split into many and every task will run separate CPU to reduce program running time.
  •   Low latency - Process huge data sets with less time.
  •   Data availability - Data will store across cluster with replication. If any machine on the cluster will down the specific machine data will available on another machine in the cluster.
  •   Fault-tolerance - The system will work as usual without any data loss, even if some system has failed. This is one of the main advantages of Hadoop.
  •   Horizontal scalability - We can add or remove server (system) on the cluster without interrupting  existing servers.

Hadoop Daemons

Hadoop Daemons
Daemons

  • NameNode - It is the master node used to store the meta information of all the files, such as no of Blocks, location, replicas and etc.. It will manage the Slave node.
  • Secondary NameNode - It's called as Checkpoint Node. It is backup node for NameNode. It gee FsImage and EditLogs from the NameNode and merges EditLogs with the Fsimage regularly.
  • DataNode - It is the slave node used to store the actual data in HDFS. It's used to perform read and write operation as per request.
  • ResourceManager - It's runs on the master node. This is one for each cluster.
  • NodeManager - It's runs on the each slave node.



<< Previous                                                                                                                                 Next >>

Comments

Post a Comment

Popular posts from this blog

HDFS Commands Part - II

HDFS Commands Part - I

Install Hadoop On Ubuntu