Which option to choose for writing CSV file in Spark (HDFS)? Which option to choose for writing CSV file in Spark (HDFS)? hadoop hadoop

Which option to choose for writing CSV file in Spark (HDFS)?


I suggest to write output to new directory on hdfs - in case of processing failure you will always be able to discard whatever was processed and launch processing from scratch with original data - it's safe and easy. :)

When the processing is complete - just delete old one and rename new one to the name of the old one.

UPDATE:

deleted_duplicate.write  .format("csv")  .mode("overwrite")  .save("hdfs://localhost:8020/data/ingestion_tmp/")   Configuration conf = new Configuration();    conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());    conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());    FileSystem  hdfs = FileSystem.get(URI.create("hdfs://<namenode-hostname>:<port>"), conf);    hdfs.delete("hdfs://localhost:8020/data/ingestion", isRecusrive);    hdfs.rename("hdfs://localhost:8020/data/ingestion_tmp", "hdfs://localhost:8020/data/ingestion");

Here is link to HDFS FileSystem API docs