Fast JSON serialization (and comparison with Pickle) for cluster computing in Python? Fast JSON serialization (and comparison with Pickle) for cluster computing in Python? json json

Fast JSON serialization (and comparison with Pickle) for cluster computing in Python?


marshal is fastest, but pickle per se is not -- maybe you mean cPickle (which is pretty fast, esp. with a -1 protocol). So, apart from readability issues, here's some code to show various possibilities:

import pickleimport cPickleimport marshalimport jsondef maked(N=5400):  d = {}  for x in range(N):    k = 'key%d' % x    v = [x] * 5    d[k] = v  return dd = maked()def marsh():  return marshal.dumps(d)def pick():  return pickle.dumps(d)def pick1():  return pickle.dumps(d, -1)def cpick():  return cPickle.dumps(d)def cpick1():  return cPickle.dumps(d, -1)def jso():  return json.dumps(d)def rep():  return repr(d)

and here are their speeds on my laptop:

$ py26 -mtimeit -s'import pik' 'pik.marsh()'1000 loops, best of 3: 1.56 msec per loop$ py26 -mtimeit -s'import pik' 'pik.pick()'10 loops, best of 3: 173 msec per loop$ py26 -mtimeit -s'import pik' 'pik.pick1()'10 loops, best of 3: 241 msec per loop$ py26 -mtimeit -s'import pik' 'pik.cpick()'10 loops, best of 3: 21.8 msec per loop$ py26 -mtimeit -s'import pik' 'pik.cpick1()'100 loops, best of 3: 10 msec per loop$ py26 -mtimeit -s'import pik' 'pik.jso()'10 loops, best of 3: 138 msec per loop$ py26 -mtimeit -s'import pik' 'pik.rep()'100 loops, best of 3: 13.1 msec per loop

so, you can have readability and ten times the speed of json.dumps with repr (you sacrifice the ease of parsing from Javascript and other languages); you can have the absolute maximum speed with marshal, almost 90 times faster than json; cPickle offers way more generality (in terms of what you can serialize) than either json or marshal, but if you're never going to use that generality then you might as well go for marshal (or repr if human readability trumps speed).

As for your "slicing" idea, in lieu of a multitude of files, you might want to consider a database (a multitude of records) -- you might even get away without actual serialization if you're running with data that has some recognizable "schema" to it.


I think you are facing a trade-off here: human-readability comes at the cost of performance and large file size. Thus, of all the serialization methods available in Python, JSON is not only the most readable, it is also the slowest.

If I had to pursue performance (and file compactness), I'd go for marshall. You can either marshal the whole set with dump() and load() or, building on your idea of slicing things up, marshal separate parts of the data set into separate files. This way you open the door for parallelization of the data processing -- if you feel so inclined.

Of course, there are all kinds of restrictions and warnings in the documentation, so if you decide to play it safe, go for pickle.