How to parse bigdata json file (wikidata) in C++ efficiently? How to parse bigdata json file (wikidata) in C++ efficiently? json json

How to parse bigdata json file (wikidata) in C++ efficiently?


I think the performance problem is not due to parsing. Using RapidJSON's SAX API should already give good performance and memory friendly. If you need to access every values in the JSON, this may already be the best solution.

However, from the question description, it seems reading all values at a time is not your requirement. You want to read some (probably small amount) values of particular criteria (e.g., by primary keys). Then reading/parsing everything is not suitable for this case.

You will need some indexing mechanism. Doing that with file position may be possible. If data at the positions also a valid JSON, you can seek and stream it to RapidJSON to parse that JSON value (RapidJSON can stop parsing when a complete JSON is parsed, by kParseStopWhenDoneFlag).

Other options are converting the JSON into some kind of database, either SQL database, key-value database or custom ones. With the provided indexing facilities, you shall query the data fast. This may take long time for conversion, but good performance for later retrieval.

Note that, JSON is an exchange format. It was not designed for fast individual queries on big data.


Update: Recently I found that there is a project semi-index that may suit your needs.


Write your own JSON parser minimizing allocations and data movement. Also ditch multi character for straight ANSI. I once wrote a XML parser to parse 4GB Xml files. I tried MSXML and Xerces both had minor memory leaks that when used on that much data would actually runout of memory. My parser would actually stop memory allocations once it reached maximum nesting level.


Your definition of the problem does not allow to give a precise answer.

I wonder why you would want to stick to JSON in the first place. It is certainly not the best format for rapid access to big data.

If you're using your wikia data intensively, why not convert them into a more manageable format altogether?

It should be easy to automate a DB definition that matches the format of your entries, and convert the big lump of JSON into DB records once and for all.

You can stop DB conversion at any point you like (i.e. store each JSON block as plain text or refine it further).
In the minimal case, you'll end up with a DB table holding your records indexed by name and key.
Certainly less messy than using your file system as a database (by creating millions of files named after name+key) or writing dedicated code to seek to the records.

That will probably save you a lot of disk space too, since internal DB storage is usually more efficient than plain textual representation.