How I can I lazily read multiple JSON values from a file/stream in Python?
JSON generally isn't very good for this sort of incremental use; there's no standard way to serialise multiple objects so that they can easily be loaded one at a time, without parsing the whole lot.
The object per line solution that you're using is seen elsewhere too. Scrapy calls it 'JSON lines':
- https://docs.scrapy.org/en/latest/topics/exporters.html?highlight=exporters#jsonitemexporter
- http://www.enricozini.org/2011/tips/python-stream-json/
You can do it slightly more Pythonically:
for jsonline in f: yield json.loads(jsonline) # or do the processing in this loop
I think this is about the best way - it doesn't rely on any third party libraries, and it's easy to understand what's going on. I've used it in some of my own code as well.
A little late maybe, but I had this exact problem (well, more or less). My standard solution for these problems is usually to just do a regex split on some well-known root object, but in my case it was impossible. The only feasible way to do this generically is to implement a proper tokenizer.
After not finding a generic-enough and reasonably well-performing solution, I ended doing this myself, writing the splitstream
module. It is a pre-tokenizer that understands JSON and XML and splits a continuous stream into multiple chunks for parsing (it leaves the actual parsing up to you though). To get some kind of performance out of it, it is written as a C module.
Example:
from splitstream import splitfilefor jsonstr in splitfile(sys.stdin, format="json")): yield json.loads(jsonstr)
Sure you can do this. You just have to take to raw_decode
directly. This implementation loads the whole file into memory and operates on that string (much as json.load
does); if you have large files you can modify it to only read from the file as necessary without much difficulty.
import jsonfrom json.decoder import WHITESPACEdef iterload(string_or_fp, cls=json.JSONDecoder, **kwargs): if isinstance(string_or_fp, file): string = string_or_fp.read() else: string = str(string_or_fp) decoder = cls(**kwargs) idx = WHITESPACE.match(string, 0).end() while idx < len(string): obj, end = decoder.raw_decode(string, idx) yield obj idx = WHITESPACE.match(string, end).end()
Usage: just as you requested, it's a generator.