overwrite hive partitions using spark overwrite hive partitions using spark hadoop hadoop

overwrite hive partitions using spark


If you are on Spark 2.3.0, try setting spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite.

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")data.write.mode("overwrite").insertInto("partitioned_table")


I would suggest to run sql using sparksession. you can run " insert overwrite partition query" by selecting the columns from existing dataset. this solution will surely overwrites partition only.


So, if you are using Spark version < 2.3 and want to write into partitions dynamically without deleting the others, you can implement the below solution.

The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.

// Create SparkSession with Hive dynamic partitioning enabledval spark: SparkSession =    SparkSession        .builder()        .appName("StatsAnalyzer")        .enableHiveSupport()        .config("hive.exec.dynamic.partition", "true")        .config("hive.exec.dynamic.partition.mode", "nonstrict")        .getOrCreate()// Register the dataframe as a Hive tableimpressionsDF.createOrReplaceTempView("impressions_dataframe")// Create the output Hive tablespark.sql(    s"""      |CREATE EXTERNAL TABLE stats (      |   ad            STRING,      |   impressions   INT,      |   clicks        INT      |) PARTITIONED BY (country STRING, year INT, month INT, day INT)      |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'    """.stripMargin)// Write the data into disk as Hive partitionsspark.sql(    s"""      |INSERT OVERWRITE TABLE stats       |PARTITION(country = 'US', year = 2017, month = 3, day)      |SELECT ad, SUM(impressions), SUM(clicks), day      |FROM impressions_dataframe      |GROUP BY ad    """.stripMargin)