Which option to choose for writing CSV file in Spark (HDFS)?
I suggest to write output to new directory on hdfs - in case of processing failure you will always be able to discard whatever was processed and launch processing from scratch with original data - it's safe and easy. :)
When the processing is complete - just delete old one and rename new one to the name of the old one.
UPDATE:
deleted_duplicate.write .format("csv") .mode("overwrite") .save("hdfs://localhost:8020/data/ingestion_tmp/") Configuration conf = new Configuration(); conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName()); FileSystem hdfs = FileSystem.get(URI.create("hdfs://<namenode-hostname>:<port>"), conf); hdfs.delete("hdfs://localhost:8020/data/ingestion", isRecusrive); hdfs.rename("hdfs://localhost:8020/data/ingestion_tmp", "hdfs://localhost:8020/data/ingestion");
Here is link to HDFS FileSystem API docs