Impala is a MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in Hadoop cluster.
A tool which we use to overcome the slowness of Hive Queries is what we call Impala. This separate tool was provided by Cloudera distribution
It is an open source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL engines for Hadoop.
Impala uses MPP (massively parallel processing) to run lightning fast queries against HDFS, HBase, etc.
It offers high-performance, low-latency SQL queries. Impala is the best option while we are dealing with medium sized datasets and we expect the real-time response from our queries. However, make sure Impala is available only in Hadoop distribution.
It can read almost all the file formats used by Hadoop. Like Parquet, Avro, RCFile
Impala is not based on MapReduce algorithms, unlike Apache Hive.
Hence, Impala is faster than Apache Hive, since it reduces the latency of utilizing MapReduce
There are 3 Daemons in Impala . they are as follow
Impala statestore : Impala statestore is install on one host of the cluster. statestore checks on the health of Impala daemons on all the DataNodes .
We can say Statestore daemon is a name service that monitors the availability of Impala services across the cluster.
Also, handles situations such as nodes becoming unavailable or becoming available again. Impala statestore keeps track of which ImpalaD’s are up and running, and relays this information to all the ImpalaD’s in the cluster. Hence, they are aware of this information when distributing tasks to other ImpalaD’s.
Impala Catelog server : Impala Catelog server is install on 1 host of the cluster .via the state stored it distributes metadata to Impala daemons.
It is physically represented by a daemon process named catalogd . You only need such a process on one host in a cluster.
Impala Deamon:
this daemon will be one per node. Moreover, on every data node, it will be installed. They form the core of the Impala execution engine and are the ones reading data from HDFS/HBase and aggregating/processing it.
in order to store the mapping between table and files this daemon will use Hive metastore.
Also, uses HDFS NN to get the mapping between files and blocks. Therefore, to get/process the data impala uses hive metastore and Name Node.
we can say all ImpalaD’s are equivalent. This daemon accepts queries from several tools such as the impala-shell command, Hue, JDBC or ODBC.
There are 3 major components of ImpalaD such as
Query Planner : Query Planner is responsible for parsing out the query this planning occurs in 2 parts.
1) Since all the data in the cluster resided on just one node, a single node plan is made, at first
2) Afterwards, on the basis of the location of various data sources in the cluster, this single node plan is converted to a distributed plan (thereby leveraging data locality).
Coordinator : Query Coordinator is responsible for coordinating the execution of the entire query. To read and process data, it sends requests to various executors. Afterward, it receives the data back from these executors and streams it back to the client via JDBC/ODBC
Executor : Executor is responsible for aggregations of data .Especially, the data which is read locally or if not available locally could be streamed from executors of other Impala daemons
No comments:
Post a Comment