what is a data serialization system? what is a data serialization system? hadoop hadoop

what is a data serialization system?


So when Hadoop was being written by Doug Cutting he decided that the standard Java method of serializing Java object using Java Object Serialization (Java Serialization) didn't meet his requirements for Hadoop. Namely, these requirements were:

  1. Serialize the data into a compact binary format.
  2. Be fast, both in performance and how quickly it allowed data to be transfered.
  3. Interoperable so that other languages plug into Hadoop more easily.

As he described Java Serialization:

It looked big and hairy and I though we needed something lean and mean

Instead of using Java Serialization they wrote their own serialization framework. The main perceived problems with Java Serialization was that it writes the classname of each object being serialized to the stream, with each subsequent instance of that class containing a 5 byte reference to the first, instead of the classname.

As well as reducing the effective bandwidth of the stream this causes problems with random access as well as sorting of records in a serialized stream. Thus Hadoop serialization doesn't write the classname or the required references, and makes the assumption that the client knows the expected type.

Java Serialization also creates a new object for each one that is deserialized. Hadoop Writables, which implement Hadoop Serialization, can be reused. Thus, helping to improve the performance of MapReduce which accentually serializes and deserializes billions of records.

Avro fits into Hadoop in that it approaches serialization in a different manner. The client and server exchange a scheme which describes the datastream. This helps make it fast, compact and importantly makes it easier to mix languanges together.

So Avro defines a serialization format, a protocol for clients and servers to communicate these serial streams and a way to compactly persist data in files.

I hope this helps. I thought a bit of Hadoop history would help understand why Avro is a subproject of Hadoop and what its meant to help with.


If you have to store in a limited file the information like the hierarchy or data structure implementation details and pass that information over a network, you use data serialization. It is close to understanding xml or json format. The benefit is that the information which is translated into any serialization format can be deserialized to regenerate the classes, objects, data structures whatever that was serialized.

actual implementation-->serialization-->.xml or .json or .avro --->deserialization--->imlementation in original form

Here is the link to the list of serialization formats. Comment if you want further information! :)