Opening A large JSON file
You want an incremental json parser like yajl and one of its python bindings. An incremental parser reads as little as possible from the input and invokes a callback when something meaningful is decoded. For example, to pull only numbers from a big json file:
class ContentHandler(YajlContentHandler): def yajl_number(self, ctx, val): list_of_numbers.append(float(val))parser = YajlParser(ContentHandler())parser.parse(some_file)
See http://pykler.github.com/yajl-py/ for more info.
I have found another python wrapper around yajl library, which is ijson.
It works better for me than yajl-py due to the following reasons:
- yajl-py did not detect yajl library on my system, I had to hack the code in order to make it work
- ijson code is more compact and easier to use
- ijson can work with both yajl v1 and yajl v2, and it even has pure python yajl replacement
- ijson has very nice ObjectBuilder, which helps extracting not just events but meaningful sub-objects from parsed stream, and at the level you specify
I found yajl (hence ijson) to be much slower than module json
when a large data file was accessed from local disk. Here is a module that claims to perform better than yajl/ijson (still slower than json
), when used with Cython:
http://pietrobattiston.it/jsaone
As the author points out, performance may be better than json
when the file is received over the network since an incremental parser can start parsing sooner.