Write to multiple outputs by key Spark - one Spark job

scala hadoop output hdfs apache-spark

If you use Spark 1.4+, this has become much, much easier thanks to the DataFrame API. (DataFrames were introduced in Spark 1.3, but partitionBy(), which we need, was introduced in 1.4.)

If you're starting out with an RDD, you'll first need to convert it to a DataFrame:

val people_rdd = sc.parallelize(Seq((1, "alice"), (1, "bob"), (2, "charlie")))val people_df = people_rdd.toDF("number", "name")

In Python, this same code is:

people_rdd = sc.parallelize([(1, "alice"), (1, "bob"), (2, "charlie")])people_df = people_rdd.toDF(["number", "name"])

Once you have a DataFrame, writing to multiple outputs based on a particular key is simple. What's more -- and this is the beauty of the DataFrame API -- the code is pretty much the same across Python, Scala, Java and R:

people_df.write.partitionBy("number").text("people")

And you can easily use other output formats if you want:

people_df.write.partitionBy("number").json("people-json")people_df.write.partitionBy("number").parquet("people-parquet")

In each of these examples, Spark will create a subdirectory for each of the keys that we've partitioned the DataFrame on:

people/  _SUCCESS  number=1/    part-abcd    part-efgh  number=2/    part-abcd    part-efgh

scala hadoop output hdfs apache-spark

I would do it like this which is scalable

import org.apache.hadoop.io.NullWritableimport org.apache.spark._import org.apache.spark.SparkContext._import org.apache.hadoop.mapred.lib.MultipleTextOutputFormatclass RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {  override def generateActualKey(key: Any, value: Any): Any =     NullWritable.get()  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =     key.asInstanceOf[String]}object Split {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("Split" + args(1))    val sc = new SparkContext(conf)    sc.textFile("input/path")    .map(a => (k, v)) // Your own implementation    .partitionBy(new HashPartitioner(num))    .saveAsHadoopFile("output/path", classOf[String], classOf[String],      classOf[RDDMultipleTextOutputFormat])    spark.stop()  }}

Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.

new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.

scala hadoop output hdfs apache-spark

If you potentially have many values for a given key, I think the scalable solution is to write out one file per key per partition. Unfortunately there is no built-in support for this in Spark, but we can whip something up.

sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))  .mapPartitionsWithIndex { (p, it) =>    val outputs = new MultiWriter(p.toString)    for ((k, v) <- it) {      outputs.write(k.toString, v)    }    outputs.close    Nil.iterator  }  .foreach((x: Nothing) => ()) // To trigger the job.// This one is Local, but you could write one for HDFSclass MultiWriter(suffix: String) {  private val writers = collection.mutable.Map[String, java.io.PrintWriter]()  def write(key: String, value: Any) = {    if (!writers.contains(key)) {      val f = new java.io.File("output/" + key + "/" + suffix)      f.getParentFile.mkdirs      writers(key) = new java.io.PrintWriter(f)    }    writers(key).println(value)  }  def close = writers.values.foreach(_.close)}

(Replace PrintWriter with your choice of distributed filesystem operation.)

This makes a single pass over the RDD and performs no shuffle. It gives you one directory per key, with a number of files inside each.

CodeHunter

Write to multiple outputs by key Spark - one Spark job

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last