What is Apache Hive

August 30, 2020

What is Apache Hive?

Hadoop is like sea with a lot of tools and technologies that are making our job done. The Hive is one of those technology. Actually hive running on top of the Hadoop. Apache Hive is a Hadoop component that is basically developed for data analysts. Even though Apache Pig can also be developed for the same purpose, Hive is used more by researchers and programmers. It is an open-source data warehousing system, which is exclusively used to query and analyze huge volume of datasets stored in Hadoop HDFS.

Hive supports for data query, data summarization and data analysis. HiveQL is the query language in Hive. This language translates SQL-like queries into MapReduce jobs for deploying them on Hadoop. Hive providing shell where we can perform basic operation which is supported by Hive. If we run HiveQL in hive shell, it will call MapReduce job internally and get back the result. Hive has the schema flexibility and data serialisation and serialisation.

Advantage of Apache Hive?

Apache Hive works extremely well with large data sets. Analysis over a large data set is made easy with hive.
Querying in Apache Hive is very simple because it is very similar to SQL.
Hive is good for ETL workloads on Hadoop.
Hive produces good ad-hoc queries required for data analysis.
Another advantage of Hive is that it is scalable.
Can write custom functions (UDF) with python and also Java.
String functions that are available in hive has been extensively used for analysis.

Dis-Advantage of Apache Hive?

Apache Hive isn't designed for and doesn't support online processing of data.
Sub queries not supported.
Updating the data can be a problematic task.
Overall speed of ad-hoc querying could be improved.
No support for transactions.
It can't process data in real time.

Search This Blog

Big Data