Write and read raw byte arrays in Spark - using Sequence File SequenceFile Write and read raw byte arrays in Spark - using Sequence File SequenceFile hadoop hadoop

Write and read raw byte arrays in Spark - using Sequence File SequenceFile


Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes

val rdd: RDD[Array[Byte]] = ???// To writerdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))  .saveAsSequenceFile("/output/path", codecOpt)// To readval rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")  .map(_._2.copyBytes())


Here is a snippet with all required imports that you can run from spark-shell, as requested by @Choix

import org.apache.hadoop.io.BytesWritableimport org.apache.hadoop.io.NullWritableval path = "/tmp/path"val rdd = sc.parallelize(List("foo"))val bytesRdd = rdd.map{str  =>  (NullWritable.get, new BytesWritable(str.getBytes) )  }bytesRdd.saveAsSequenceFile(path)val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())val recoveredAsString = recovered.map( new String(_) )recoveredAsString.collect()// result is:  Array[String] = Array(foo)