Cloud Stable: Theory Of Bucketing

Theory of Hive Bucketing

Bucketing in Hive is a data organizing technique. Clustering is a technique to split the data into more manageable files by specifying the number of buckets to create at the time of creating a Hive table . The value of the bucketing column will be hashed by a user-defined number into buckets.It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets. So, we can use bucketing in Hive when the implementation of partitioning becomes difficult.

with the help of bucketing in hive , you can decompose a table data set into smaller parts, making them easier

to handle . Bucketing allows you to group similar data types and write them to one single file, which enhance

performance while joining tables or reading data. This is big reason why we use bucketing with partitioning .

The concept of bucketing is based on the hashing technique

modules of current column value and the number of required buckets is calculated (let say, F(x) % 3)

Now, based on the resulted value, the data is stored into the corresponding bucket

Enable the bucketing by using the following command: -

hive> set hive.enforce.bucketing = true;

Create a bucketing table by using the following command: -

hive> create table emp_bucket(Id int, Name string , Salary float)

clustered by (Id) into 3 buckets

row format delimited

fields terminated by ',' ;

Create a dummy table to store the data.

hive> create table emp_demo (Id int, Name string , Salary float)

row format delimited

fields terminated by ',' ;

Now, load the data into the table.

hive> load data local inpath '/home/codegyani/hive/emp_details' into table emp_demo;

Now, insert the data of dummy table into the bucketed table.

hive> insert overwrite table emp_bucket select * from emp_demo;

Cloud Stable

Saturday, May 28, 2022

Theory Of Bucketing

No comments:

Post a Comment