Thursday, June 30, 2022

Performance tunning in hive

 


***Performance tunning in hive ***


There are several Hive optimization techniques to improve its performance which we can implement when we run our hive queries


Tez Execution Engine  : Tez Execution Engine  is a new application framework built on Hadoop Yarn That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework


Usage of Suitable File Format in Hive : 

ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance.

for increasing your query performance ORC file format is best suitable.ORC refers to Optimized Row Columnar that implies we can store data in an optimized way than the other file formats.ORC reduces the size of the original data up to 75%.  data processing speed also increases.On comparing to Text, Sequence and RC file formats, ORC shows better performance.


Hive Partitioning: 

By Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition.While we write the query to fetch the values from the table, only the required partitions of the table are queried.  it reduces the time taken by the query to yield the result.


Bucketing in Hive :

Bucketing in hive is the concept of breaking data down into ranges, which are known as buckets, to give extra structure to the data so it may be used for more efficient queries. The range for a bucket is determined by the hash value of one or more columns in the dataset (or Hive metastore table)

Enable the bucketing by using the following command: 

1.     hive> set hive.enforce.bucketing = true;  


Create a bucketing table by using the following command

          

          hive> create table employee_bucket(Id int, Name string , Salary float)    

clustered by (Id) into 3 buckets  

row format delimited    

fields terminated by ',' ;  



Vectorization In Hive : 

in order to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time. in a very layman's language, we can improve the performance of aggregations, filters, and joins of our hive queries by using vectorized query execution, which means scanning them in batches of 1024 rows at once instead of single row each time.

We can set below parameters which will help to bring in more parallelism and which significantly improves query execution time

set hive.vectorized.execution.enabled=true; 
set hive.exec.parallel=true;


For example:

Select x.*, y.* from
(select * from firsttable ) x
Join
(select * from secondtable ) y
On x.id=y.id
;


Cost-Based Optimization in Hive: 

Before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others. These decisions are collected by ANALYZE statements or the metastore itself, ultimately cutting down on query execution time and reducing resource utilization


To use CBO, set the following parameters at the beginning of your query

set hive.cbo.enable=true;


Hive Indexing : for the original table use of indexing will create a separate called index table which acts as a reference.

it will take a large amount of time if we want to perform queries only on some columns without indexing.

Because queries will be executed on all the columns present in the table.there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. it checks the index first and then goes to the particular column and performs the operation.

maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time.


 YouTube link: Hadoop series 1 - YouTube


 Follow 👉 syed ashraf quadri👈 for awesome stuff



hive architecture

 Hive Architecture


Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and commands.


Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based GUI for executing Hive queries and commands.


Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and provides it to Hive Driver.

HiveServer2 is the successor of HiveServer1.HiveServer2 enables clients to execute queries against the Hive. It allows multiple clients to submit requests to Hive and retrieve the final results. Hive Server1 It does not handle concurrent requests from more than one client due to which it was replaced by HiveServer2


It is basically designed to provide the best support for open API clients like JDBC and ODBC



Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.


Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.


Hive Optimizer : Optimizer generates the logical plan in the form of DAG of map-reduce tasks and HDFS tasks.


Hive Execution Engine : the execution engine executes the incoming tasks in the order of their dependencies


Hive MetaStore - It is a central repository that stores all the structure information of various tables and partitions in the warehouse. It also includes metadata of column and its type information, the serializers and deserializers which is used to read and write data and the corresponding HDFS files where the data is stored.

This metastore is generally a relational database.


Wednesday, June 29, 2022

impala architecture

 


Impala Deamon: 


Impala daemon : It generally identified by the Impalad process . it runs on every node in the CDH cluster. It accepts the queries from various interfaces like impala shell, hue browser, etc.… and processes them.whenever any Impala node in the cluster creates, alter, drops any object or any statement like insert, load data is processed, each Daemon will also receive the broadcasted message


Whenever a query is submitted to an impalad on a particular node, that node serves as a “coordinator node” for that query

in order to store the mapping between table and files this daemon will use Hive metastore. 

 Also, uses HDFS NN to get the mapping between files and blocks. Therefore, to get/process the data impala uses hive metastore and Name Node.

we can say all ImpalaD’s are equivalent.

There are 3 major components of ImpalaD such as



Query Planner :  Query Planner is responsible for parsing out the query  this planning occurs in 2 parts.

 1)  Since all the data in the cluster resided on just one node, a single node plan is made, at first

 

 2) Afterwards, on the basis of the location of various data sources in the cluster, this single node plan is converted to a distributed plan (thereby leveraging data locality).



Coordinator  :   Query Coordinator is responsible for coordinating the execution of the entire query. To read and process data, it sends requests to various executors. Afterward, it receives the data back from these executors and streams it back to the client via JDBC/ODBC

 

Executor : Executor is responsible for  aggregations of data .Especially, the data which is read locally or if not available locally could be streamed from executors of other Impala daemons



Impala Catelog server :   Impala Catelog server is install on 1 host of the cluster .via the state stored it distributes metadata to Impala daemons.

It is physically represented by a daemon process named catalogd . You only need such a process on one host in a cluster.



Impala statestore :The name of the Impala State store daemon process is State stored .

Impala statestore  is install on one host of the cluster. statestore  checks on the health of Impala daemons on all the DataNodes .

 We can say Statestore daemon is a name service that monitors the availability of Impala services across the cluster. 

Also, handles situations such as nodes becoming unavailable or becoming available again. Impala statestore keeps track of which ImpalaD’s are up and running, and relays this information to all the ImpalaD’s in the cluster. Hence, they are aware of this information when distributing tasks to other ImpalaD’s.

In the event of a node failure due to any reason, Statestore updates all other nodes about this failure and once such a notification is available to the other impalad, no other Impala daemon assigns any further queries to the affected node.