What is HDFS

Hadoop HDFS

What is HDFS

HDFS (Hadoop Distributed File System) is a file system like our normal desktop/laptop file system which is used to store the data. It's specially designed for storing huge datasets with cluster of commodity hardware and with streaming access pattern.   The data may be text file, image, audio, video, etc...

Streaming access pattern

Streaming access pattern means write once read many number of time but don't change content of the file is called as streaming access pattern.

Operations in HDFS

  1. Write Operation
  2. Read Operation 

Write Operation

Assume that you are writing file into HDFS. Your write request will go NameNode (NN) Distributed File System (DFS). The DFS will make RPC call to the namenode for create new file.

Before creating file the namenode will do couple of things. It will check whether the file is not exist and user has permission to create new file. Once all the check is done successfully the namenode will provide available DataNode (DN).

Now, the datanode will start write file and it will take care of splitting the file into blocks and replication of each block based on configuration. The datanode will send success acknowledge after completion of replication of blocks.

Read Operation

Assume that you are trying to read file from HDFS. Now the request will go to namenode DFS. The namenode will check whether the file is exist and use has permission to read this specific file. If all check done successfully it will give the file meta information like in which datanode has requested file and block and replica detail. based on that datanode will collect the block of data and return.



What next?

Interact with HDFS using Command Line Interface (CLI).


Post your queries in comment section :)

 

Comments

Post a Comment

Popular posts from this blog

HDFS Commands Part - I

HDFS Commands Part - II

Install Hadoop On Ubuntu