Which option to choose for writing CSV file in Spark (HDFS)?

I suggest to write output to new directory on hdfs - in case of processing failure you will always be able to discard whatever was processed and launch processing from scratch with original data - it's safe and easy. :)

When the processing is complete - just delete old one and rename new one to the name of the old one.


deleted_duplicate.write  .format("csv")  .mode("overwrite")  .save("hdfs://localhost:8020/data/ingestion_tmp/")   Configuration conf = new Configuration();    conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());    conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());    FileSystem  hdfs = FileSystem.get(URI.create("hdfs://<namenode-hostname>:<port>"), conf);    hdfs.delete("hdfs://localhost:8020/data/ingestion", isRecusrive);    hdfs.rename("hdfs://localhost:8020/data/ingestion_tmp", "hdfs://localhost:8020/data/ingestion");

Here is link to HDFS FileSystem API docs