Posts

What is Apache Hive

Image
What is Apache Hive? Hadoop is like sea with a lot of tools and technologies that are making our job done. The Hive is one of those technology. Actually hive running on top of the Hadoop. Apache Hive is a Hadoop component that is basically developed for data analysts. Even though Apache Pig can also be developed for the same purpose, Hive is used more by researchers and programmers. It is an open-source data warehousing system, which is exclusively used to query and analyze huge volume of datasets stored in Hadoop HDFS . Hive supports for data query, data summarization and data analysis. HiveQL is the query language in Hive. This language translates SQL-like queries into MapReduce jobs for deploying them on Hadoop. Hive providing shell where we can perform basic operation which is supported by Hive. If we run HiveQL in hive shell, it will call MapReduce job internally and get back the result. Hive has the schema flexibility and data serialisation and serialisation. Advantage of...

Apache Hadoop MapReduce

Image
Before drive into this tutorial, I suggest you to read what is Bigdata , Hadoop , HDFS and YARN . If you not aware of those topics. What is MapReduce MapReduce is an Apache framework used to process large amount of data in parallel which is presented on hadoop cluster. It's will do the job as divide and conquer manner. There are two key components in Hadoop MapReduce. Component Mapper and Reducer Mapper It takes input from input split and process the each input split. The result of processed input split will be collection of key, value pairs. The result will be persist on the local disk. The number of mapper will be desired based on input split. How many number of inputsplit is there, those many number of mapper will be run. Reducer It takes Mapper output as a input and process that intermediate result (collection of key, value pairs) and combine those key, value pairs and create a smaller set of collection (Key, value pairs). The final output wi...

YARN - Yet Another Resource Negotiator

Image
I hope you know what is BigData , Hadoop and HDFS If not, I suggest you to read above topics before read this. What is YARN? YARN stands for Yet Another Resource Negotiator. It's one of the Hadoop core components. YARN is use to manage the hadoop cluster. like schedule task and manage the resource. In Hadoop V1, MapReduce is the one who handled all resource related details and task/job details. It's over load for MapRedice job. So, in Hadoop V2 they splitted resource related things separately and name as YARN. Components: Resource Manager Node Manager Resource Manager It's master node in YARN. Only per cluster. It knows the slave node details. It inhabit the JobTracker of MapReduce Version 1 (MRV1). Resource Scheduler Resource Scheduler is responsible for allocating resource to application and it's not perform any monitoring and tracking activities like application failure, Hardware failure and so on. App Manager It maintain the...

HDFS Commands Part - II

Image
In part - I session we learned about HDFS basic commands, in this session will see the intermediate level commands. Before read this article I suggest you to learn basic hdfs commands.   Commands 1. copyFromLocal This HDFS command is similar to put command, but the source is restricted to a local file reference.      Usages: hdfs dfs -copyFromLocal <local_path> <hdfs_path>      Example: hdfs dfs -copyFromLocal /home/user/Desktop/file.orc /dir_1/ 2. copyToLocal This HDFS command will copy file/directory from HDFS to local file system.      Usages: hdfs dfs -copyToLocal <hdfs_path <local_path>      Example: hdfs dfs -copyToLocal /dir_1/file.orc /home/user/Desktop/ 3. text This HDFS command will take the source file and display the file content in text formad.      Usages: hdfs dfs -text <hdfs_file_path>   ...

What is HDFS

Image
What is HDFS HDFS (Hadoop Distributed File System) is a file system like our normal desktop/laptop file system which is used to store the data. It's specially designed for storing huge datasets with cluster of commodity hardware and with streaming access pattern.   The data may be text file, image, audio, video, etc... Streaming access pattern Streaming access pattern means write once read many number of time but don't change content of the file is called as streaming access pattern. Operations in HDFS Write Operation Read Operation   Write Operation Assume that you are writing file into HDFS. Your write request will go NameNode (NN) Distributed File System (DFS). The DFS will make RPC call to the namenode for create new file. Before creating file the namenode will do couple of things. It will check whether the file is not exist and user has permission to create new file. Once all the check is done successfully the namenode will provide a...

HDFS Commands Part - I

Image
Prequirement Before start Hadoop shell have to install Hadoop . File System Shell Most of the commands in FS shell is like corresponding Linux commands. The FileSystem (FS) shell is invoked by bin/hadoop fs <args> . All the FS shell commands take path URIs as arguments. For HDFS the scheme is hdfs , and for the local filesystem the scheme is file . The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://<namenodehost>/dir_1/dir_2 or simply as /dir_1/dir_2 ( given that your configuration is set to point to hdfs://<namenodehost> ). Error information is sent to stderr and the output is sent to stdout . Basic Commands 1. version   This HDFS command prints the Hadoop version.        Example: hdfs dfs version 2. cat  This HDFS command used to displays the conten...

Install Hadoop On Ubuntu

Image
Prequirement Before installing Hadoop, you have to install Java . Hadoop Installation Steps Step 1: Create Separate Login          $ sudo addgroup hadoop          $ sudo adduser –ingroup hadoop hdfs user          $ sudo adduser hdfsuser sudo   Step 2: Install SSH          $ sudo apt-get update          $ sudo apt-get install ssh          $ sudo su hdfsuser          $ sudo ssh-keygen -t rsa -p ""                >> If it's asking for file name or location, leave it blank.                $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys     ...