How to parse bigdata json file (wikidata) in C++ efficiently?

c++ json bigdata rapidjson wikidata

I think the performance problem is not due to parsing. Using RapidJSON's SAX API should already give good performance and memory friendly. If you need to access every values in the JSON, this may already be the best solution.

However, from the question description, it seems reading all values at a time is not your requirement. You want to read some (probably small amount) values of particular criteria (e.g., by primary keys). Then reading/parsing everything is not suitable for this case.

You will need some indexing mechanism. Doing that with file position may be possible. If data at the positions also a valid JSON, you can seek and stream it to RapidJSON to parse that JSON value (RapidJSON can stop parsing when a complete JSON is parsed, by kParseStopWhenDoneFlag).

Other options are converting the JSON into some kind of database, either SQL database, key-value database or custom ones. With the provided indexing facilities, you shall query the data fast. This may take long time for conversion, but good performance for later retrieval.

Note that, JSON is an exchange format. It was not designed for fast individual queries on big data.

Update: Recently I found that there is a project semi-index that may suit your needs.

c++ json bigdata rapidjson wikidata

Write your own JSON parser minimizing allocations and data movement. Also ditch multi character for straight ANSI. I once wrote a XML parser to parse 4GB Xml files. I tried MSXML and Xerces both had minor memory leaks that when used on that much data would actually runout of memory. My parser would actually stop memory allocations once it reached maximum nesting level.

c++ json bigdata rapidjson wikidata

Your definition of the problem does not allow to give a precise answer.

I wonder why you would want to stick to JSON in the first place. It is certainly not the best format for rapid access to big data.

If you're using your wikia data intensively, why not convert them into a more manageable format altogether?

It should be easy to automate a DB definition that matches the format of your entries, and convert the big lump of JSON into DB records once and for all.

You can stop DB conversion at any point you like (i.e. store each JSON block as plain text or refine it further).
In the minimal case, you'll end up with a DB table holding your records indexed by name and key.
Certainly less messy than using your file system as a database (by creating millions of files named after name+key) or writing dedicated code to seek to the records.

That will probably save you a lot of disk space too, since internal DB storage is usually more efficient than plain textual representation.

CodeHunter

How to parse bigdata json file (wikidata) in C++ efficiently?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last