Remove Empty Partitions from Spark RDD
There isn't an easy way to simply delete the empty partitions from a RDD.
coalesce
doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45)
.
The repartition
method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20)
, the data will be evenly split across the 20 partitions.