How to write to CSV in Spark

file csv hadoop apache-spark distributed-computing

Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do

rdd.saveAsTextFile("foo")

It will be saved as "foo/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX files inside foo as well.

file csv hadoop apache-spark distributed-computing

I'll suggest to do it in this way (Java example):

theRddToPrint.coalesce(1, true).saveAsTextFile(textFileName);FileSystem fs = anyUtilClass.getHadoopFileSystem(rootFolder);FileUtil.copyMerge(    fs, new Path(textFileName),    fs, new Path(textFileNameDestiny),    true, fs.getConf(), null);

file csv hadoop apache-spark distributed-computing

Extending Tathagata Das answer to Spark 2.x and Scala 2.11

Using Spark SQL we can do this in one liner

//implicits for magic functions like .toDfimport spark.implicits._val df = Seq(  ("first", 2.0),  ("choose", 7.0),  ("test", 1.5)).toDF("name", "vals")//write DataFrame/DataSet to external storagedf.write  .format("csv")  .save("csv/file/location")

CodeHunter

How to write to CSV in Spark

Extending Tathagata Das answer to Spark 2.x and Scala 2.11

Then you can go head and proceed with adoalonso's answer.

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last