Which option to choose for writing CSV file in Spark (HDFS)?

scala apache-spark hadoop dataframe hdfs

I suggest to write output to new directory on hdfs - in case of processing failure you will always be able to discard whatever was processed and launch processing from scratch with original data - it's safe and easy. :)

When the processing is complete - just delete old one and rename new one to the name of the old one.

UPDATE:

deleted_duplicate.write  .format("csv")  .mode("overwrite")  .save("hdfs://localhost:8020/data/ingestion_tmp/")   Configuration conf = new Configuration();    conf.set("fs.hdfs.impl",org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());    conf.set("fs.file.impl",org.apache.hadoop.fs.LocalFileSystem.class.getName());    FileSystem  hdfs = FileSystem.get(URI.create("hdfs://<namenode-hostname>:<port>"), conf);    hdfs.delete("hdfs://localhost:8020/data/ingestion", isRecusrive);    hdfs.rename("hdfs://localhost:8020/data/ingestion_tmp", "hdfs://localhost:8020/data/ingestion");

Here is link to HDFS FileSystem API docs

CodeHunter

Which option to choose for writing CSV file in Spark (HDFS)?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last