Dealing with an incompatible version change of a serialization framework Dealing with an incompatible version change of a serialization framework hadoop hadoop

Dealing with an incompatible version change of a serialization framework


I would use the multiple classloaders approach.

(Package renaming will also work. It does seem ugly, but this is a one-off hack so beauty and correctness can take a back seat. Intermediate serialization seems risky - there was a reason you are using Kryo, and that reason will be negated by using a different intermediate form).

The overall design would be:

child classloaders:      Old Kryo     New Kryo   <-- both with simple wrappers                                \       /                                 \     /                                  \   /                                   \ /                                    |default classloader:    domain model; controller for the re-serialization
  1. Load the domain object classes in the default classloader
  2. Load a Jar with the modified Kryo version and wrapper code. The wrapper has a static 'main' method with one argument: The name of the file to deserialize. Call the main method via reflection from the default classloader:

        Class deserializer = deserializerClassLoader.loadClass("com.example.deserializer.Main");    Method mainIn = deserializer.getMethod("main", String.class);    Object graph = mainIn.invoke(null, "/path/to/input/file");
    1. This method:
      1. Deserializes the file as one object graph
      2. Places the object into a shared space. ThreadLocal is a simple way, or returning it to the wrapper script.
  3. When the call returns, load a second Jar with the new serialization framework with a simple wrapper. The wrapper has a static 'main' method and an argument to pass the name of the file to serialize in. Call the main method via reflection from the default classloader:

        Class serializer = deserializerClassLoader.loadClass("com.example.serializer.Main");    Method mainOut = deserializer.getMethod("main", Object.class, String.class);    mainOut.invoke(null, graph, "/path/to/output/file");
    1. This method
      1. Retrieves the object from the ThreadLocal
      2. Serializes the object and writes it to the file

Considerations

In the code fragments, one classloader is created for each object serialization and deserialization. You probably want to load the classloaders only once, discover the main methods and loop over the files, something like:

for (String file: files) {    Object graph = mainIn.invoke(null, file + ".in");    mainOut.invoke(null, graph, file + ".out");}

Do the domain objects have any reference to any Kryo class? If so, you have difficulties:

  1. If the reference is just a class reference, eg to call a method, then the first use of the class will load one of the two Kryo versions into the default classloader. This probably will cause problems as part of the serialization or deserialization might be performed by the wrong version of Kryo
  2. If the reference is used to instantiate any Kryo objects and store the reference in the domain model (class or instance members), then Kryo will actually be serializing part of itself in the model. This may be a deal-breaker for this approach.

In either case, your first approach should be to examine these references and eliminate them. One approach to ensure that you have done this is to ensure the default classloader does not have access to any Kryo version. If the domain objects reference Kryo in any way, the reference will fail (with a ClassNotFoundError if the class is referenced directly or ClassNotFoundException if reflection is used).


For 2, you can create two jar files that contain the serializer and all the dependencies for the new and old versions of your serializer as shown here. Then create a map reduce job that loads each version of your code in a separate class loader, and add some glue code in the middle which deserializes with the old code, then serializes with the new code.

You will have to be careful that your domain object is loaded in the same class loader as your glue code, and the code to serialize/deserialize depends on the same class loader as your glue code so that they both see the same domain object class.


The most easiest way I would come up without thinking is using an additional Java application doing the transformation for you. So you send the binary data to the secondary java application (simple local sockets would do the trick nicely) so you do not have to fiddle with classloaders or packages.

The only thing to think about is the intermediate representation. You might want to use another serialization mechanism or if time is no issue you might want to use the internal serialization of Java.

Using a second Java application saves you from dealing with a temporary storage and do everything in memory.

And once you have those sockets + second application code you find tons of situations where this comes handy.

Also one can build a local cluster using jGroups and save the hassle with sockets after all. jGroups is the most simply communication API I know off. Just form a logical channel and check who joins. And best it even works within the same JVM which makes testing easy and if done remotely one can bind different physical server together just the same way it would work for local applications.

Another variable alternative is using ZeroMQ with its ipc (inter process communication) protocol.