Impala
Impala
is a MPP (Massive Parallel Processing) SQL query engine for processing huge
volumes of data that is stored in Hadoop cluster. It is an open source software
which is written in C++ and Java. It provides high performance and low latency
compared to other SQL engines for Hadoop. Impala is the highest performing SQL
engine (giving RDBMS-like experience) which provides the fastest way to access
data that is stored in Hadoop Distributed File System.
Impala
SQL engines claiming to do parallel
processing! Impala’s open source Massively Parallel Processing (MPP) SQL
engine is here, armed with all the power to push you aside. The only condition
it needs is data be stored in a cluster of computers running Apache Hadoop,
which, given Hadoop’s dominance in data warehousing, isn’t uncommon. Cloudera Impala
was announced on the world stage in October 2012 and after a successful
beta run, was made available to the general public in May 2013.
Imapala Artecture
Three
important deamons of impla are as follow
Impala statestore : Impala statestore is
install on one host of the cluster. statestore checks on the
health of Impala daemons on all the DataNodes. It is physically represented by
a daemon process named statestored. We can
say Statestore daemon is a name service that monitors the availability of
Impala services across the cluster. Also,
handles situations such as nodes becoming unavailable or becoming available
again. Impala statestore keeps track of which ImpalaD’s are up and running, and
relays this information to all the ImpalaD’s in the cluster. Hence, they are
aware of this information when distributing tasks to other ImpalaD’s.
In the event of a node failure due
to any reason, Statestore updates all other nodes about this failure and
once such a notification is
available to the other impalad, no other Impala daemon
assigns any further queries to the affected node.
Impala Catelog server: Impala Catelog server is install on 1 host of the
cluster. via the state stored it distributes metadata to Impala
daemons. It is physically represented by a daemon process named catalogd. You only need such a process on one host in a
cluster. Usually, statestored and catalogd process will be running on same host
as catalog services are passed through statestored. The catalog service avoids
the need to issue REFRESH and INVALIDATE METADATA statements when the
metadata changes are performed by statements issued through Impala. The Impala
component known as the Catalog Service relays the metadata changes from Impala
SQL statements to all the Impala daemons in a cluster. Most considerations for
load balancing and high availability apply to the impalad daemon. The statestored and catalogd
daemons do not have special requirements for high availability, because
problems with those daemons do not result in data loss. If those daemons become
unavailable due to an outage on a particular host, you can stop the Impala
service, delete the Impala StateStore and Impala Catalog Server
roles, add the roles on a different host, and restart the Impala service
Impala
Deamon (Impalad):
This daemon will be one per node.
Moreover, on every data node, it will be installed. They form the core of the
Impala execution engine and are the ones reading data from HDFS/HBase and
aggregating/processing it. It accepts the queries from various interfaces like
impala shell, hue browser, etc.… and processes them. Whenever a query is
submitted to an impalad on a particular node, that node serves as a “coordinator node” for
that query. Multiple queries are served by Impalad running on other nodes as
well the core Impala component is the Impala daemon, physically
represented by the impalad
process. After accepting the query, Impalad reads and writes to data
files and parallelizes the queries by distributing the work to the other Impala
nodes in the Impala cluster. When queries are processing on various Impalad
instances, all of them return the result to the central coordinating node.
After
receiving the query, the query coordinator verifies whether the query is
appropriate, using the Table Schema from the Hive meta store. it then collects the information about the
location of the data that is required to execute the query, from HDFS name node
and sends this information to other impalads in order to execute the query .Depending on the requirement, queries can be submitted to a
dedicated Impalad or in a load balanced manner to another Impalad in your cluster.
In order to store the mapping between table and files this daemon will use Hive metastore. All
the other Impala daemons read the specified data block and processes the query.
As soon all the daemons complete their tasks, the query coordinator collects
the result back and delivers it to the user.
Also, uses HDFS NN to get the mapping between files and blocks.
Therefore, to get/process the data impala uses hive metastore and
Name Node. we can say all ImpalaD’s are equivalent. This daemon accepts queries
from several tools such as the impala-shell command, Hue, JDBC or ODBC. The
Impala daemons are in constant communication with StateStore, to
confirm which daemons are healthy and can accept new work. They also receive
broadcast messages from the catalogd daemon. whenever any Impala daemon in the
cluster creates, alters, or drops any type of object, or when an INSERT or LOAD
DATA statement is processed through Impala. This background communication minimizes
the need for REFRESH or INVALIDATE METADATA
statements that were needed to coordinate metadata across Impala daemons prior
to Impala
1.2.
There
are 3 major components of ImpalaD such as
Query Planner: Query Planner is responsible for parsing out
the query this planning occurs in 2 parts.
1)
Since all the data in the cluster resided on just one node, a single
node plan is made, at first
2) Afterwards, on the basis of the location of
various data sources in the cluster, this single node plan is converted to a
distributed plan (thereby leveraging data locality).
Coordinator: Query Coordinator is responsible for
coordinating the execution of the entire query. To read and process data, it
sends requests to various executors. Afterward, it receives the data back from
these executors and streams it back to the client via JDBC/ODBC
Executor: Executor is responsible for aggregations of data. Especially, the
data which is read locally or if not available locally could be streamed from
executors of other Impala daemons
Why Impala
Use impala for following reasons
Do BI-style Queries on Hadoop
Impala provides low latency
and high concurrency for BI/analytic queries on Hadoop (not delivered by batch
frameworks such as Apache Hive). Impala also scales linearly, even in
multitenant environments.
Unify Your Infrastructure
Utilize the same file and data
formats and metadata, security, and resource management frameworks as your
Hadoop deployment—no redundant infrastructure or data conversion/duplication.
Implement Quickly
For Apache Hive users, Impala
utilizes the same metadata and ODBC driver. Like Hive, Impala supports SQL, so
you don't have to worry about re-inventing the implementation wheel.
Count on Enterprise-class Security
Impala is integrated with
native Hadoop security and Kerberos for authentication, and via the Sentry
module, you can ensure that the right users and applications are authorized for
the right data.
Retain Freedom from Lock-in
Impala is open source (Apache
License).
Expand the Hadoop User-verse
With Impala, more users,
whether using SQL queries or BI applications, can interact with more data
through a single repository and metadata store from source through analysis.
impala combines the SQL
support and multi-user performance of a traditional analytic database with the
scalability and flexibility of Apache Hadoop, by utilizing standard components
such as HDFS, HBase, Metastore, YARN, and Sentry. With Impala, users can
communicate with HDFS or HBase using SQL queries in a faster way compared to
other SQL engines like Hive. Impala can read almost all the file formats such
as Parquet, Avro, RCFile used by Hadoop. Impala uses the same metadata, SQL
syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache
Hive, providing a familiar and unified platform for batch-oriented or real-time
queries. Unlike Apache Hive, Impala is
not based on MapReduce algorithms. It implements a distributed architecture
based on daemon processes that are responsible for all the aspects of query
execution that run on the same machines. Thus, it reduces the latency of
utilizing MapReduce and this makes Impala faster than Apache Hive
Count on Enterprise-class Security
Impala is integrated with
native Hadoop security and Kerberos for authentication, and via the Sentry
module, you can ensure that the right users and applications are authorized for
the right data.
How Impala Uses HDFS
Impala uses the
distributed filesystem HDFS as its primary data storage medium. Impala relies
on the redundancy provided by HDFS to guard against hardware or network outages
on individual nodes. Impala table data is physically represented as data files
in HDFS, using familiar HDFS file formats and compression codecs. When data
files are present in the directory for a new table, Impala reads them all,
regardless of file name. New data is added in files with names controlled by Impala.
How Impala Uses HBase
HBase is an alternative to
HDFS as a storage medium for Impala data. It is a database storage system built
on top of HDFS, without built-in SQL support. Many Hadoop users already have it
configured and store large (often sparse) data sets in it. By defining tables
in Impala and mapping them to equivalent tables in HBase, you can query the
contents of the HBase tables through Impala, and even perform join queries
including both Impala and HBase tables. You can use Impala to query HBase
tables. This is useful for accessing any of your existing HBase tables via SQL
and performing analytics over them. HDFS and Kudu tables are preferred over
HBase for analytic workloads and offer superior performance. Kudu supports
efficient inserts, updates and deletes of small numbers of rows and can replace
HBase for most analytics-oriented use cases.
Using Impala to Query Kudu Tables
You can use Impala to query
tables stored by Apache Kudu. This capability allows convenient access to a storage
system that is tuned for different kinds of workloads than the default with
Impala. By default, Impala tables are stored on HDFS using data files with
various file formats. HDFS files are ideal for bulk loads (append operations)
and queries using full-table scans, but do not support in-place updates or
deletes. Kudu is an alternative storage engine used by Impala which can do both
in-place updates (for mixed read/write workloads) and fast scans (for
data-warehouse/analytic operations). Using Kudu tables with Impala can simplify
the ETL pipeline by avoiding extra steps to segregate and reorganize newly
arrived data. Certain Impala SQL statements and clauses, such as DELETE,
UPDATE, UPSERT, and PRIMARY KEY work only with Kudu tables. Other statements and
clauses, such as LOAD DATA, TRUNCATE TABLE, and INSERT OVERWRITE, are not
applicable to Kudu tables
Advantages of
Impala
Here is a list of some noted advantages of
Cloudera Impala.
Ø Using impala, you can
process data that is stored in HDFS at lightning-fast speed with traditional
SQL knowledge.
Ø Since the data processing
is carried where the data resides (on Hadoop cluster), data transformation and
data movement is not required for data stored on Hadoop, while working with
Impala.
Ø Using Impala, you can access
the data that is stored in HDFS, HBase, and Amazon s3 without the knowledge of
Java (MapReduce jobs). You can access them with a basic idea of SQL queries.
Ø To write queries in
business tools, the data has to be gone through a complicated extract-transform-load
(ETL) cycle. But, with Impala, this procedure is shortened. The time-consuming
stages of loading & reorganizing is overcome with the new techniques such
as exploratory data analysis & data discovery making
the process faster.
Ø Impala is pioneering the
use of the Parquet file format, a columnar storage layout that is optimized for
large-scale queries typical in data warehouse scenarios.
Ø Fast Speed
Ø No need to Move Data
Ø Easy Access
Ø Short Procedure
Ø File Format
Ø Big Data
Ø Relational model
Ø Languages
Ø Familiar
Ø Distributed
Ø Faster Access
Ø High Performance
Disadvantages of Impala
Ø No Support SerDe
Ø No custom Binary Files
Ø Need to Refresh
Ø No Support for Triggers
Ø No Updation
Ø No Transactions
Ø No Indexing
Data Types Supported by Impala
It supports all the datatypes be it numeric,
character, date.
Ø Tinyint
Ø Smallint
Ø Int
Ø Bigint
Ø Float
Ø Boolean
Ø String
Ø Varchar
Ø Char
Ø Double
Ø Real
Ø Decimal
Impala
Features
There are several features of
Impala, let’s discuss all the Impala features one by one
Ø Open Source
Basically, under the Apache license, Impala is available freely as
open source.
Ø In-memory Processing
While it’s come to processing,
Cloudera Impala supports in-memory data processing. That implies without any
data movement it accesses/analyzes data that is stored on Hadoop data
nodes. the data are present in memory that makes the query optimization faster
and easy.
Ø Easy
Data Access
However, using SQL-like
queries, we can easily access data using Impala. Moreover, Impala offers Common
data access interfaces. That includes:
i. JDBCdriver.
ii. ODBC driver.
Ø Faster Access
While we compare Impala to
another SQL engines, Impala offers faster access to the data in HDFS.
Ø Storage Systems
We can easily store data in
storage systems such as HDFS, Apache HBase, and Amazon
s3.
i. HDFS file formats: Delimited text files, Parquet,
Avro, SequenceFile, and RCFile.
ii. Compression codecs: Snappy, GZIP, Deflate,
BZIP.
Ø Easy
Integration
It is possible to integrate
Impala with business intelligence tools such as Tableau, Pentaho, Micro
strategy, and Zoom data.
Ø File Formats
There are several files formats
which Impala supports like LZO, Sequence File, Avro, RCFile, and Parquet.
Ø Drivers from Hive
There is one advantage, Impala
uses from Hive. That is its metadata, ODBC driver, and
SQL syntax.
Ø Joins and Functions
Including SELECT, joins, and
aggregate functions, Impala offers most common SQL-92 features of Hive
Query Language (HiveQL).
Ø Developed
Basically, Cloudera Impala is
written in C++ and Java languages.
Ø Relational model
One of the major points is
Impala follows the Relational model.
Ø Data Model
However, Impala’s data model is
Schema-based in nature.
Ø API’s
While it comes to API’s, Impala
offers JDBC and ODBC API’s.
Ø Languages Support
Moreover, it supports all
languages supporting JDBC/ODBC.
Ø High Performance
While we compare Impala to
another SQL engines, Impala offers high performance and low latency for Hadoop.
Ø Query UI
Moreover, it supports, Hue
Beeswax and the Cloudera Impala Query UI.
Ø CLI
It supports Impala-shell
command-line interface.
Ø
Authentication
it offers Kerberos
authentication.
Ø Faster Access:
It is comparatively much faster
for SQL Queries and data processing.
Ø Storage System:
Has the capability to access
HDFS and HBASE as its storage system.
Ø Secured:
It offers Kerberos Authentication
making it secure.
Ø Multi API Supports:
Its supports multiple API that
helps it for the connection with the data sources.
Ø Easily Connected:
It is easy to connect over many
data visualization engines such as TABLEAU, etc.
Ø Built-in Functions:
Impala
comes over with several built-in Functions with which we can go over with the
results we need. Some of the built-in function IMPALA supports are abs,concat,
power, adddate, date_add, dayofweek, nvl, zeroifnull.
Hardware Requirements for Optimal
Join Performance
During join operations, portions of data from each joined table
are loaded into memory. Data sets can be very large, so ensure your hardware
has sufficient memory to accommodate the joins you anticipate completing.
While requirements vary according to data set size, the following
is generally recommended:
CPU
Impala version 2.2 and higher uses the SSSE3 instruction set,
which is included in newer processors.
Note: This required level of processor is the same as in Impala version
1.x. The Impala 2.0 and 2.1 releases had a stricter requirement for the SSE4.1
instruction set, which has now been relaxed.
Memory
128 GB or more recommended,
ideally 256 GB or more. If the intermediate results
during query processing on a particular node exceed the amount of memory
available to Impala on that node, the query writes temporary work data to disk,
which can lead to long query times. Note that because the work is parallelized,
and intermediate results for aggregate queries are typically smaller than the
original data, Impala can query and join tables that are much larger than the
memory available on an individual node.
JVM Heap Size for Catalog Server
4 GB or more recommended, ideally 8 GB or
more, to accommodate the maximum numbers of tables, partitions, and data files
you are planning to use with Impala.
Storage
DataNodes
with 12 or more disks each. I/O speeds are often the limiting factor for disk
performance with Impala. Ensure that you have sufficient disk space to store
the data Impala will be querying.
User Account Requirements
Impala
creates and uses a user and group named impala
.
Do not delete this account or group and do not modify the account's or group's
permissions and rights. Ensure no existing systems obstruct the functioning of
these accounts and groups. For example, if you have scripts that delete user
accounts not in a white-list, add these accounts to the list of permitted
accounts.
For
correct file deletion during DROP TABLE
operations,
Impala must be able to move files to the HDFS trashcan. You might need to
create an HDFS directory /user/impala, writeable by
the impala
user, so
that the trashcan can be created. Otherwise, data files might remain behind
after a DROP TABLE
statement.
Impala
should not run as root. Best Impala performance is achieved using direct reads,
but root is not permitted to use direct reads. Therefore, running Impala as
root negatively affects performance.
By
default, any user can connect to Impala and access all the associated databases
and tables. You can enable authorization and authentication based on the Linux
OS user who connects to the Impala server, and the associated groups for that
user. Impala Security for details. These security features do not
change the underlying file permission requirements; the impala
user still needs to be able to access
the data files.
SQL Operations that Spill to
Disk
Certain memory-intensive
operations write temporary data to disk (known as spilling to disk) when Impala
is close to exceeding its memory limit on a particular host.
The result is a query that
completes successfully, rather than failing with an out-of-memory error. The
tradeoff is decreased performance due to the extra disk I/O to write the
temporary data and read it back in. The slowdown could be potentially be
significant. Thus, while this feature improves reliability, you should optimize
your queries, system parameters, and hardware configuration to make this
spilling a rare occurrence.
What kinds of queries might spill to disk:
Several SQL clauses and constructs require memory allocations that
could activate the spilling mechanism:
Ø
when a query uses a GROUP BY clause for columns with
millions or billions of distinct values, Impala keeps a similar number of
temporary results in memory, to accumulate the aggregate results for each value
in the group.
Ø
When large tables are joined together, Impala keeps the values of
the join columns from one table in memory, to compare them to incoming values
from the other table.
Ø
When a large result set is sorted by the ORDER
BY clause, each node sorts its portion of the result set in memory.
Ø
The DISTINCT and UNION operators build
in-memory data structures to represent all values found so far, to eliminate
duplicates as the query progresses.
When the spill-to-disk feature is activated for a join node within
a query, Impala does not produce any runtime filters for that join operation on
that host. Other join nodes within the query are not affected.
How Impala handles scratch disk space for
spilling:
By default, intermediate files used during large sort, join,
aggregation, or analytic function operations are stored in the directory
/tmp/impala-scratch, and these intermediate files are removed when the
operation finishes. You can specify a different location by starting the impalad
daemon with the ‑‑scratch_dirs="path_to_directory" configuration
option.
Memory usage for SQL operators:
In Impala 2.10 and higher, the way SQL operators
such as GROUP BY, DISTINCT, and joins, transition between using additional
memory or activating the spill-to-disk feature is changed. The memory required
to spill to disk is reserved up front, and you can examine it in the EXPLAIN
plan when the EXPLAIN_LEVEL query option is set to 2 or higher.
The infrastructure of the spilling feature
affects the way the affected SQL operators, such as GROUP BY, DISTINCT, and
joins, use memory. On each host that participates in the query, each such
operator in a query requires memory to store rows of data and other data
structures. Impala reserves a certain amount of memory up front for each
operator that supports spill-to-disk that is sufficient to execute the
operator. If an operator accumulates more data than can fit in the reserved
memory, it can either reserve more memory to continue processing data in memory
or start spilling data to temporary scratch files on disk. Thus, operators with
spill-to-disk support can adapt to different memory constraints by using
however much memory is available to speed up execution, yet tolerate low memory
conditions by spilling data to disk.
The amount data depends on the portion of the
data being handled by that host, and thus the operator may end up consuming
different amounts of memory on different hosts.
Added in: This feature was added to the ORDER BY
clause in Impala 1.4. This feature was extended to cover join queries,
aggregation functions, and analytic functions in Impala 2.0. The size of the
memory work area required by each operator that spills was reduced from 512
megabytes to 256 megabytes in Impala 2.2. The spilling mechanism was reworked
to take advantage of the Impala buffer pool feature and be more predictable and
stable in Impala 2.10.
Testing performance implications of spilling to
disk:
To artificially provoke spilling, to test this
feature and understand the performance implications, use a test environment
with a memory limit of at least 2 GB. Issue the SET command with no arguments
to check the current setting for the MEM_LIMIT query option. Set the query option
DISABLE_UNSAFE_SPILLS=true. This option limits the spill-to-disk feature to
prevent runaway disk usage from queries that are known in advance to be
suboptimal. Within impala-shell, run a query that you expect to be
memory-intensive, based on the criteria explained earlier. A self-join of a
large table is a good candidate:
select count(*) from big_table a join
big_table b using (column_with_many_values);
Issue the PROFILE command to get a detailed
breakdown of the memory usage on each node during the query.
Set the MEM_LIMIT query option to a value that
is smaller than the peak memory usage reported in the profile output. Now try
the memory-intensive query again.
Check if the query fails with a message like
the following:
WARNINGS: Spilling has been disabled for plans
that do not have stats and are not hinted to prevent potentially bad plans from
using too many cluster resources. Compute stats on these tables, hint the plan
or disable this behavior via query options to enable spilling. If so, the query
could have consumed substantial temporary disk space, slowing down so much that
it would not complete in any reasonable time. Rather than rely on the
spill-to-disk feature in this case, issue the COMPUTE STATS statement for the
table or tables in your sample query. Then run the query again, check the peak
memory usage again in the PROFILE output, and adjust the memory limit again if
necessary to be lower than the peak memory usage.
At this point, you have a query that is
memory-intensive, but Impala can optimize it efficiently so that the memory
usage is not exorbitant. You have set an artificial constraint through the
MEM_LIMIT option so that the query would normally fail with an out-of-memory
error. But the automatic spill-to-disk feature means that the query should
actually succeed, at the expense of some extra disk I/O to read and write
temporary work data.
Try the query again, and confirm that it
succeeds. Examine the PROFILE output again. This time, look for lines of this
form:
- SpilledPartitions: N
If you see any such lines with N greater than
0, that indicates the query would have failed in Impala releases prior to 2.0,
but now it succeeded because of the spill-to-disk feature. Examine the total
time taken by the AGGREGATION_NODE or other query fragments containing non-zero
SpilledPartitions values. Compare the times to similar fragments that did not
spill, for example in the PROFILE output when the same query is run with a
higher memory limit. This gives you an idea of the performance penalty of the
spill operation for a particular query with a particular memory limit. If you
make the memory limit just a little lower than the peak memory usage, the query
only needs to write a small amount of temporary data to disk. The lower you set
the memory limit, the more temporary data is written and the slower the query
becomes.
Now
repeat this procedure for actual queries used in your environment. Use the
DISABLE_UNSAFE_SPILLS setting to identify cases where queries used more memory
than necessary due to lack of statistics on the relevant tables and columns,
and issue COMPUTE STATS where necessary.
When to use DISABLE_UNSAFE_SPILLS:
You might wonder, why not leave
DISABLE_UNSAFE_SPILLS turned on all the time. Whether and how frequently to use
this option depends on your system environment and workload.
DISABLE_UNSAFE_SPILLS is suitable for an
environment with ad hoc queries whose performance characteristics and memory
usage are not known in advance. It prevents "worst-case scenario"
queries that use large amounts of memory unnecessarily. Thus, you might turn
this option on within a session while developing new SQL code, even though it
is turned off for existing applications.
Organizations where table and column
statistics are generally up-to-date might leave this option turned on all the
time, again to avoid worst-case scenarios for untested queries or if a problem
in the ETL pipeline results in a table with no statistics. Turning on
DISABLE_UNSAFE_SPILLS lets you "fail fast" in this case and immediately
gather statistics or tune the problematic queries.
Some
organizations might leave this option turned off. For example, you might have
tables large enough that the COMPUTE STATS takes substantial time to run,
making it impractical to re-run after loading new data. If you have examined
the EXPLAIN plans of your queries and know that they are operating efficiently,
you might leave DISABLE_UNSAFE_SPILLS turned off. In that case, you know that
any queries that spill will not go overboard with their memory consumption.
Limits on Query Size and
Complexity
There are hardcoded limits on the maximum size
and complexity of queries. Currently, the maximum number of expressions in a
query is 2000. You might exceed the limits with large or deeply nested queries
produced by business intelligence tools or other query generators.
If you have the ability to customize such
queries or the query generation logic that produces them, replace sequences of
repetitive expressions with single operators such as IN or BETWEEN that can
represent multiple values or ranges. For example, instead of a large number of
OR clauses:
WHERE val = 1 OR val = 2 OR val
= 6 OR val = 100 ...
use a single IN clause:
WHERE
val IN (1,2,6,100,...)
Scalability Considerations for
File Handle Caching
One scalability aspect that affects heavily
loaded clusters is the load on the metadata layer from looking up the details
as each file is opened. On HDFS, that can lead to increased load on the
NameNode, and on S3, this can lead to an excessive number of S3 metadata
requests. For example, a query that does a full table scan on a partitioned
table may need to read thousands of partitions, each partition containing
multiple data files. Accessing each column of a Parquet file also involves a
separate "open" call, further increasing the load on the NameNode.
High NameNode overhead can add startup time (that is, increase latency) to
Impala queries, and reduce overall throughput for non-Impala workloads that
also require accessing HDFS files.
You can reduce the number of calls made to
your file system's metadata layer by enabling the file handle caching feature.
Data files that are accessed by different queries, or even multiple times
within the same query, can be accessed without a new "open" call and without
fetching the file details multiple times.
Impala supports file handle caching for the
following file systems:
HDFS in Impala 2.10 and higher
In Impala 3.2 and higher, file handle caching
also applies to remote HDFS file handles. This is controlled by the cache_remote_file_handles
flag for an impalad. It is recommended that you use the default value of true
as this caching prevents your NameNode from overloading when your cluster has
many remote HDFS reads.
S3 in Impala 3.3 and higher
The cache_s3_file_handles impalad flag
controls the S3 file handle caching. The feature is enabled by default with the
flag set to true.
The feature is enabled by default with 20,000
file handles to be cached. To change the value, set the configuration option
max_cached_file_handles to a non-zero value for each impalad daemon. From the
initial default value of 20000, adjust upward if NameNode request load is still
significant, or downward if it is more important to reduce the extra memory
usage on each host. Each cache entry consumes 6 KB, meaning that caching 20,000
file handles requires up to 120 MB on each Impala executor. The exact memory
usage varies depending on how many file handles have actually been cached;
memory is freed as file handles are evicted from the cache.
If a manual operation moves a file to the
trashcan while the file handle is cached, Impala still accesses the contents of
that file. This is a change from prior behavior. Previously, accessing a file
that was in the trashcan would cause an error. This behavior only applies to
non-Impala methods of removing files, not the Impala mechanisms such as
TRUNCATE TABLE or DROP TABLE.
If files are removed, replaced, or appended by
operations outside of Impala, the way to bring the file information up to date
is to run the REFRESH statement on the table.
File handle cache entries are evicted as the
cache fills up, or based on a timeout period when they have not been accessed
for some time.
To evaluate the effectiveness of file handle
caching for a particular workload, issue the PROFILE statement in impala-shell
or examine query profiles in the Impala Web UI. Look for the ratio of
CachedFileHandlesHitCount (ideally, should be high) to
CachedFileHandlesMissCount (ideally, should be low). Before starting any
evaluation, run several representative queries to "warm up" the cache
because the first time each data file is accessed is always recorded as a cache
miss.
To see metrics about file handle caching for
each impalad instance, examine the following fields on the /metrics page in the
Impala Web UI:
impala-server.io.mgr.cached-file-handles-miss-count
impala-server.io.mgr.num-cached-file-handles
Understanding Impala Query Performance -
EXPLAIN Plans and Query Profiles
To understand the high-level performance considerations for Impala
queries, read the output of the EXPLAIN statement for the query. You can get
the EXPLAIN plan without actually running the query itself.
For an overview of the physical performance characteristics for a
query, issue the SUMMARY statement in impala-shell immediately after executing
a query. This condensed information shows which phases of execution took the
most time, and how the estimates for memory usage and number of rows at each
phase compare to the actual values. To understand the detailed performance
characteristics for a query, issue the PROFILE statement in impala-shell
immediately after executing a query. This low-level information includes
physical details about memory, CPU, I/O, and network usage, and thus is only
available after the query is actually run.
Using the EXPLAIN Plan for Performance Tuning
The EXPLAIN statement gives you an outline of the logical steps
that a query will perform, such as how the work will be distributed among the
nodes and how intermediate results will be combined to produce the final result
set. You can see these details before actually running the query. You can use
this information to check that the query will not operate in some very
unexpected or inefficient way.
[impalad-host:21000] > explain select count(*) from customer_address;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
| |
| 03:AGGREGATE [MERGE FINALIZE] |
| | output: sum(count(*)) |
| | |
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HDFS [default.customer_address] |
| partitions=1/1 size=5.25MB |
Read the EXPLAIN plan from bottom to top:
The last part of the plan shows the low-level details such as the
expected amount of data that will be read, where you can judge the
effectiveness of your partitioning strategy and estimate how long it will take
to scan a table based on total data size and the size of the cluster.
As you work your way up, next you see the operations that will be
parallelized and performed on each Impala node.
At the higher levels, you see how data flows when intermediate
result sets are combined and transmitted from one node to another.
See EXPLAIN_LEVEL Query Option for details about the EXPLAIN_LEVEL
query option, which lets you customize how much detail to show in the EXPLAIN
plan depending on whether you are doing high-level or low-level tuning, dealing
with logical or physical aspects of the query.
The EXPLAIN plan is also printed at the beginning of the query
profile report described in Using the Query Profile for Performance Tuning, for
convenience in examining both the logical and physical aspects of the query
side-by-side.
The amount of detail displayed in the EXPLAIN output is controlled
by the EXPLAIN_LEVEL query option. You typically increase this setting from
standard to extended (or from 1 to 2) when doublechecking the presence of table
and column statistics during performance tuning, or when estimating query
resource usage in conjunction with the resource management features in CDH 5.
Using the SUMMARY Report for Performance Tuning
The SUMMARY command within the impala-shell interpreter gives you
an easy-to-digest overview of the timings for the different phases of execution
for a query. Like the EXPLAIN plan, it is easy to see potential performance
bottlenecks. Like the PROFILE output, it is available after the query is run
and so displays actual timing numbers.
The SUMMARY report is also printed at the beginning of the query
profile report described in Using the Query Profile for Performance Tuning, for
convenience in examining high-level and low-level aspects of the query
side-by-side.
For example, here is a query involving an aggregate function, on a
single-node VM. The different stages of the query and their timings are shown
(rolled up for all nodes), along with estimated and actual values used in
planning the query. In this case, the AVG() function is computed for a subset
of data on each node (stage 01) and then the aggregated results from all nodes
are combined at the end (stage 03). You can see which stages took the most
time, and whether any estimates were substantially different than the actual
data distribution. (When examining the time values, be sure to consider the
suffixes such as us for microseconds and ms for milliseconds, rather than just
looking for the largest numbers.)