Spark output filename and append on write

java azure hadoop apache-spark

1) There is no direct support in saveAsTextFile method to control file output name.You can try using saveAsHadoopDataset to control output file basename.

e.g.: instead of part-00000 you can get yourCustomName-00000.

Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.

In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.

SparkConf conf=new SparkConf();conf.setMaster("local").setAppName("yello");JavaSparkContext sc=new JavaSparkContext(conf);JobConf jobConf=new JobConf();jobConf.set("mapreduce.output.basename", "customName");jobConf.set("mapred.output.dir", "outputPath");JavaRDD<String> input = sc.textFile("inputDir");input.saveAsHadoopDataset(jobConf);

2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.

CodeHunter

Spark output filename and append on write

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last