Big Data

Posts

Apache Hadoop MapReduce

October 01, 2019

Before drive into this tutorial, I suggest you to read what is Bigdata , Hadoop , HDFS and YARN . If you not aware of those topics. What is MapReduce MapReduce is an Apache framework used to process large amount of data in parallel which is presented on hadoop cluster. It's will do the job as divide and conquer manner. There are two key components in Hadoop MapReduce. Component Mapper and Reducer Mapper It takes input from input split and process the each input split. The result of processed input split will be collection of key, value pairs. The result will be persist on the local disk. The number of mapper will be desired based on input split. How many number of inputsplit is there, those many number of mapper will be run. Reducer It takes Mapper output as a input and process that intermediate result (collection of key, value pairs) and combine those key, value pairs and create a smaller set of collection (Key, value pairs). The final output wi...

YARN - Yet Another Resource Negotiator

September 26, 2019

I hope you know what is BigData , Hadoop and HDFS If not, I suggest you to read above topics before read this. What is YARN? YARN stands for Yet Another Resource Negotiator. It's one of the Hadoop core components. YARN is use to manage the hadoop cluster. like schedule task and manage the resource. In Hadoop V1, MapReduce is the one who handled all resource related details and task/job details. It's over load for MapRedice job. So, in Hadoop V2 they splitted resource related things separately and name as YARN. Components: Resource Manager Node Manager Resource Manager It's master node in YARN. Only per cluster. It knows the slave node details. It inhabit the JobTracker of MapReduce Version 1 (MRV1). Resource Scheduler Resource Scheduler is responsible for allocating resource to application and it's not perform any monitoring and tracking activities like application failure, Hardware failure and so on. App Manager It maintain the...

HDFS Commands Part - II

September 19, 2019

In part - I session we learned about HDFS basic commands, in this session will see the intermediate level commands. Before read this article I suggest you to learn basic hdfs commands. Commands 1. copyFromLocal This HDFS command is similar to put command, but the source is restricted to a local file reference. Usages: hdfs dfs -copyFromLocal <local_path> <hdfs_path> Example: hdfs dfs -copyFromLocal /home/user/Desktop/file.orc /dir_1/ 2. copyToLocal This HDFS command will copy file/directory from HDFS to local file system. Usages: hdfs dfs -copyToLocal <hdfs_path <local_path> Example: hdfs dfs -copyToLocal /dir_1/file.orc /home/user/Desktop/ 3. text This HDFS command will take the source file and display the file content in text formad. Usages: hdfs dfs -text <hdfs_file_path> ...

What is HDFS

September 14, 2019

What is HDFS HDFS (Hadoop Distributed File System) is a file system like our normal desktop/laptop file system which is used to store the data. It's specially designed for storing huge datasets with cluster of commodity hardware and with streaming access pattern. The data may be text file, image, audio, video, etc... Streaming access pattern Streaming access pattern means write once read many number of time but don't change content of the file is called as streaming access pattern. Operations in HDFS Write Operation Read Operation Write Operation Assume that you are writing file into HDFS. Your write request will go NameNode (NN) Distributed File System (DFS). The DFS will make RPC call to the namenode for create new file. Before creating file the namenode will do couple of things. It will check whether the file is not exist and user has permission to create new file. Once all the check is done successfully the namenode will provide a...

HDFS Commands Part - I

November 30, 2018

Prequirement Before start Hadoop shell have to install Hadoop . File System Shell Most of the commands in FS shell is like corresponding Linux commands. The FileSystem (FS) shell is invoked by bin/hadoop fs <args> . All the FS shell commands take path URIs as arguments. For HDFS the scheme is hdfs , and for the local filesystem the scheme is file . The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://<namenodehost>/dir_1/dir_2 or simply as /dir_1/dir_2 ( given that your configuration is set to point to hdfs://<namenodehost> ). Error information is sent to stderr and the output is sent to stdout . Basic Commands 1. version This HDFS command prints the Hadoop version. Example: hdfs dfs version 2. cat This HDFS command used to displays the conten...

Install Hadoop On Ubuntu

August 05, 2018

Prequirement Before installing Hadoop, you have to install Java . Hadoop Installation Steps Step 1: Create Separate Login $ sudo addgroup hadoop $ sudo adduser –ingroup hadoop hdfs user $ sudo adduser hdfsuser sudo Step 2: Install SSH $ sudo apt-get update $ sudo apt-get install ssh $ sudo su hdfsuser $ sudo ssh-keygen -t rsa -p "" >> If it's asking for file name or location, leave it blank. $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ...

Search This Blog