Write and read raw byte arrays in Spark - using Sequence File SequenceFile
Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes
is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes
does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes
val rdd: RDD[Array[Byte]] = ???// To writerdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray))) .saveAsSequenceFile("/output/path", codecOpt)// To readval rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path") .map(_._2.copyBytes())
Here is a snippet with all required imports that you can run from spark-shell, as requested by @Choix
import org.apache.hadoop.io.BytesWritableimport org.apache.hadoop.io.NullWritableval path = "/tmp/path"val rdd = sc.parallelize(List("foo"))val bytesRdd = rdd.map{str => (NullWritable.get, new BytesWritable(str.getBytes) ) }bytesRdd.saveAsSequenceFile(path)val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())val recoveredAsString = recovered.map( new String(_) )recoveredAsString.collect()// result is: Array[String] = Array(foo)