Write to multiple outputs by key Spark - one Spark job Write to multiple outputs by key Spark - one Spark job hadoop hadoop

Write to multiple outputs by key Spark - one Spark job


If you use Spark 1.4+, this has become much, much easier thanks to the DataFrame API. (DataFrames were introduced in Spark 1.3, but partitionBy(), which we need, was introduced in 1.4.)

If you're starting out with an RDD, you'll first need to convert it to a DataFrame:

val people_rdd = sc.parallelize(Seq((1, "alice"), (1, "bob"), (2, "charlie")))val people_df = people_rdd.toDF("number", "name")

In Python, this same code is:

people_rdd = sc.parallelize([(1, "alice"), (1, "bob"), (2, "charlie")])people_df = people_rdd.toDF(["number", "name"])

Once you have a DataFrame, writing to multiple outputs based on a particular key is simple. What's more -- and this is the beauty of the DataFrame API -- the code is pretty much the same across Python, Scala, Java and R:

people_df.write.partitionBy("number").text("people")

And you can easily use other output formats if you want:

people_df.write.partitionBy("number").json("people-json")people_df.write.partitionBy("number").parquet("people-parquet")

In each of these examples, Spark will create a subdirectory for each of the keys that we've partitioned the DataFrame on:

people/  _SUCCESS  number=1/    part-abcd    part-efgh  number=2/    part-abcd    part-efgh


I would do it like this which is scalable

import org.apache.hadoop.io.NullWritableimport org.apache.spark._import org.apache.spark.SparkContext._import org.apache.hadoop.mapred.lib.MultipleTextOutputFormatclass RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {  override def generateActualKey(key: Any, value: Any): Any =     NullWritable.get()  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =     key.asInstanceOf[String]}object Split {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("Split" + args(1))    val sc = new SparkContext(conf)    sc.textFile("input/path")    .map(a => (k, v)) // Your own implementation    .partitionBy(new HashPartitioner(num))    .saveAsHadoopFile("output/path", classOf[String], classOf[String],      classOf[RDDMultipleTextOutputFormat])    spark.stop()  }}

Just saw similar answer above, but actually we don't need customized partitions. The MultipleTextOutputFormat will create file for each key. It is ok that multiple record with same keys fall into the same partition.

new HashPartitioner(num), where the num is the partition number you want. In case you have a big number of different keys, you can set number to big. In this case, each partition will not open too many hdfs file handlers.


If you potentially have many values for a given key, I think the scalable solution is to write out one file per key per partition. Unfortunately there is no built-in support for this in Spark, but we can whip something up.

sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))  .mapPartitionsWithIndex { (p, it) =>    val outputs = new MultiWriter(p.toString)    for ((k, v) <- it) {      outputs.write(k.toString, v)    }    outputs.close    Nil.iterator  }  .foreach((x: Nothing) => ()) // To trigger the job.// This one is Local, but you could write one for HDFSclass MultiWriter(suffix: String) {  private val writers = collection.mutable.Map[String, java.io.PrintWriter]()  def write(key: String, value: Any) = {    if (!writers.contains(key)) {      val f = new java.io.File("output/" + key + "/" + suffix)      f.getParentFile.mkdirs      writers(key) = new java.io.PrintWriter(f)    }    writers(key).println(value)  }  def close = writers.values.foreach(_.close)}

(Replace PrintWriter with your choice of distributed filesystem operation.)

This makes a single pass over the RDD and performs no shuffle. It gives you one directory per key, with a number of files inside each.