Why does spark throw NotSerializableException org.apache.hadoop.io.NullWritable with sequence files Why does spark throw NotSerializableException org.apache.hadoop.io.NullWritable with sequence files hadoop hadoop

Why does spark throw NotSerializableException org.apache.hadoop.io.NullWritable with sequence files


So it is possible to read non-serializable types into an RDD - i.e. have an RDD of something that is not serializable (which seems counter intuitive). But once you wish to perform an operation on that RDD that requires the objects to be serializable, like repartition it needs to be serializable. Moreover it turns out that those weird classes SomethingWritable, although invented for the sole perpose of serializing things are not actually serializable :(. So you must map these things to byte arrays and back again:

sc.sequenceFile[NullWritable, BytesWritable](in).map(_._2.copyBytes()).repartition(1000).map(a => (NullWritable.get(), new BytesWritable(a))).saveAsSequenceFile(out, None)

Also see: https://stackoverflow.com/a/22594142/1586965


In spark if you try to use a third party class which is not serializable it throws NotSerializable exception.It's because of the closure property of spark i.e whatever instance variable (which are defined outside the transformation operation) you try to access inside a transformation operation spark tries to serialize it as well as all the dependent classes of that object.