Create hive external table from partitioned parquet files in Azure HDInsights Create hive external table from partitioned parquet files in Azure HDInsights azure azure

Create hive external table from partitioned parquet files in Azure HDInsights


After you create the partitioned table, run the following in order to add the directories as partitions

MSCK REPAIR TABLE table_name;

If you have a large number of partitions you might need to set hive.msck.repair.batch.size

When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)


Written by the OP:

This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.

As a workaround, there is another way to add partitions to Hive metastore one by one like:

alter table table_name add partition(year=2016, month=10, day=11, hour=11)

We wrote simple script to automate this alter statement and it seems to work for now.