Hive Table partition with column in the middle Hive Table partition with column in the middle hadoop hadoop

Hive Table partition with column in the middle


Let's take an example: you want to have a partitioned hive table with three columns (id INT, fname STRING, dt STRING, lname STRING) where id, fname, lname are columns that store an integer id, string first name and a string last name respecitvely and dt is a partition column of type string that contains date in yyyy-MM-dd format. To create a table like this you would issue a command like:

CREATE EXTERNAL TABLE my_table (id INT, fname STRING, lname STRING)PARTITIONED BY (dt STRING)LOCATION '/usr/hive/warehouse/my_table';

When you insert data into this table (via INSERT OVERWRITE command, say) and go check the HDFS location (/usr/hive/warehouse/my_table), you would find that the data is stored in directories; one directory per partition. The name of the directory would be something like dt=2012-01-01 or dt=2012-02-22. Inside these directories would be your actual data in whatever format you had selected it to be stored in. The partition column is not stored with this data; it is a virtual column that is deciphered from the partition directory your data is present in.

Now let's get to your question. Since partitioning column is a virtual column, you can not put a partitioned Hive table on top of your data as it is (regardless of whether your to-be-partitioning column is present in the middle of the file or at the end). You need the appropriate directory structure to be present in HDFS for partitioning to work. You would want to create a staging table that is not partitioned.

CREATE EXTERNAL TABLE my_table_staging (id INT, fname STRING, dt STRING, lname STRING)LOCATION '/usr/hive/warehouse/my_table_staging';

Then use this staging table as a source to populate your partitioned table using Dynamic partitioning. You can use a command like the one below for this:

INSERT OVERWRITE TABLE my_table PARTITION (dt)SELECT id, fname, lname, dt FROM my_table_staging;

This command will read the data from your staging table and insert it into the partitioned table, creating the appropriate directory structure for you on the HDFS.

References: - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL - https://cwiki.apache.org/Hive/dynamicpartitions.html