overwrite hive partitions using spark
If you are on Spark 2.3.0, try setting spark.sql.sources.partitionOverwriteMode
setting to dynamic
, the dataset needs to be partitioned, and the write mode overwrite.
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")data.write.mode("overwrite").insertInto("partitioned_table")
I would suggest to run sql using sparksession. you can run " insert overwrite partition query" by selecting the columns from existing dataset. this solution will surely overwrites partition only.
So, if you are using Spark version < 2.3 and want to write into partitions dynamically without deleting the others, you can implement the below solution.
The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.
// Create SparkSession with Hive dynamic partitioning enabledval spark: SparkSession = SparkSession .builder() .appName("StatsAnalyzer") .enableHiveSupport() .config("hive.exec.dynamic.partition", "true") .config("hive.exec.dynamic.partition.mode", "nonstrict") .getOrCreate()// Register the dataframe as a Hive tableimpressionsDF.createOrReplaceTempView("impressions_dataframe")// Create the output Hive tablespark.sql( s""" |CREATE EXTERNAL TABLE stats ( | ad STRING, | impressions INT, | clicks INT |) PARTITIONED BY (country STRING, year INT, month INT, day INT) |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' """.stripMargin)// Write the data into disk as Hive partitionsspark.sql( s""" |INSERT OVERWRITE TABLE stats |PARTITION(country = 'US', year = 2017, month = 3, day) |SELECT ad, SUM(impressions), SUM(clicks), day |FROM impressions_dataframe |GROUP BY ad """.stripMargin)