Saturday, May 28, 2022

Theory Of Bucketing

 Theory of Hive Bucketing 

Bucketing in Hive is a data organizing technique. Clustering is a technique to split the data into more manageable files by specifying the number of buckets to create at the time of creating a Hive table . The value of the bucketing column will be hashed by a user-defined number into buckets.It is similar to partitioning in Hive with an added functionality that it divides large datasets into more manageable parts known as buckets. So, we can use bucketing in Hive when the implementation of partitioning becomes difficult.

with the help of bucketing in hive , you can decompose a table data set into smaller parts, making them easier
to handle . Bucketing allows you to group similar data types and write them to one single file, which enhance 
performance  while joining tables or reading data. This is  big reason why we use bucketing with partitioning .
The concept of bucketing is based on the hashing technique
modules of current column value and the number of required buckets is calculated (let say, F(x) % 3)
Now, based on the resulted value, the data is stored into the corresponding bucket
Enable the bucketing by using the following command: -
hive> set hive.enforce.bucketing = true;  
Create a bucketing table by using the following command: -
hive> create table emp_bucket(Id int, Name string , Salary float)    
clustered by (Id) into 3 buckets  
row format delimited    
fields terminated by ',' ;
Create a dummy table to store the data.
hive> create table emp_demo (Id int, Name string , Salary float)    
row format delimited    
fields terminated by ',' ; 

Now, load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_details' into table emp_demo;  
Now, insert the data of dummy table into the bucketed table.
hive> insert overwrite table emp_bucket select * from emp_demo;    

No comments:

Post a Comment