Cloud Stable: Performance tunning in hive

***Performance tunning in hive ***

There are several Hive optimization techniques to improve its performance which we can implement when we run our hive queries

Tez Execution Engine : Tez Execution Engine is a new application framework built on Hadoop Yarn That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework

Usage of Suitable File Format in Hive :

ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance.

for increasing your query performance ORC file format is best suitable.ORC refers to Optimized Row Columnar that implies we can store data in an optimized way than the other file formats.ORC reduces the size of the original data up to 75%. data processing speed also increases.On comparing to Text, Sequence and RC file formats, ORC shows better performance.

Hive Partitioning:

By Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition.While we write the query to fetch the values from the table, only the required partitions of the table are queried. it reduces the time taken by the query to yield the result.

Bucketing in Hive :

Bucketing in hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be used for more efficient queries. The range for a bucket is determined by the hash value of one or more columns in the dataset (or Hive metastore table)

Enable the bucketing by using the following command:

1. hive> set hive.enforce.bucketing = true;

Create a bucketing table by using the following command

hive> create table employee_bucket(Id int, Name string , Salary float)

clustered by (Id) into 3 buckets

row format delimited

fields terminated by ',' ;

Vectorization In Hive :

in order to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time. in a very layman's language, we can improve the performance of aggregations, filters, and joins of our hive queries by using vectorized query execution, which means scanning them in batches of 1024 rows at once instead of single row each time.

We can set below parameters which will help to bring in more parallelism and which significantly improves query execution time

set hive.vectorized.execution.enabled=true; 
set hive.exec.parallel=true;

For example:

Select x.*, y.* from
(select * from firsttable ) x
Join
(select * from secondtable ) y
On x.id=y.id
;

Cost-Based Optimization in Hive:

Before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others. These decisions are collected by ANALYZE statements or the metastore itself, ultimately cutting down on query execution time and reducing resource utilization

To use CBO, set the following parameters at the beginning of your query

set hive.cbo.enable=true;

Hive Indexing : for the original table use of indexing will create a separate called index table which acts as a reference.

it will take a large amount of time if we want to perform queries only on some columns without indexing.

Because queries will be executed on all the columns present in the table.there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. it checks the index first and then goes to the particular column and performs the operation.

maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time.

YouTube link: Hadoop series 1 - YouTube

Follow 👉 syed ashraf quadri👈 for awesome stuff

Cloud Stable

Thursday, June 30, 2022

Performance tunning in hive

No comments:

Post a Comment