Stream based parsing and writing of JSON

ruby json parsing memory io

Why can't you retrieve a single record at a time from the database, process it as necessary, convert it to JSON, then emit it with a trailing/delimiting comma?

If you started with a file that only contained [, then appended all your JSON strings, then, on the final entry didn't append a comma, and instead used a closing ], you'd have a JSON array of hashes, and would only have to process one row's worth at a time.

It'd be a tiny bit slower (maybe) but wouldn't impact your system. And DB I/O can be very fast if you use blocking/paging to retrieve a reasonable number of records at a time.

For instance, here's a combination of some Sequel example code, and code to extract the rows as JSON and build a larger JSON structure:

require 'json'require 'sequel'DB = Sequel.sqlite # memory databaseDB.create_table :items do  primary_key :id  String :name  Float :priceenditems = DB[:items] # Create a dataset# Populate the tableitems.insert(:name => 'abc', :price => rand * 100)items.insert(:name => 'def', :price => rand * 100)items.insert(:name => 'ghi', :price => rand * 100)add_comma = falseputs '['items.order(:price).each do |item|  puts ',' if add_comma  add_comma ||= true  print JSON[item]endputs "\n]"

Which outputs:

[{"id":2,"name":"def","price":3.714714089426208},{"id":3,"name":"ghi","price":27.0179624376119},{"id":1,"name":"abc","price":52.51248221170203}]

Notice the order is now by "price".

Validation is easy:

require 'json'require 'pp'pp JSON[<<EOT][{"id":2,"name":"def","price":3.714714089426208},{"id":3,"name":"ghi","price":27.0179624376119},{"id":1,"name":"abc","price":52.51248221170203}]EOT

Which results in:

[{"id"=>2, "name"=>"def", "price"=>3.714714089426208}, {"id"=>3, "name"=>"ghi", "price"=>27.0179624376119}, {"id"=>1, "name"=>"abc", "price"=>52.51248221170203}]

This validates the JSON and demonstrates that the original data is recoverable. Each row retrieved from the database should be a minimal "bitesized" piece of the overall JSON structure you want to build.

Building upon that, here's how to read incoming JSON in the database, manipulate it, then emit it as a JSON file:

require 'json'require 'sequel'DB = Sequel.sqlite # memory databaseDB.create_table :items do  primary_key :id  String :jsonenditems = DB[:items] # Create a dataset# Populate the tableitems.insert(:json => JSON[:name => 'abc', :price => rand * 100])items.insert(:json => JSON[:name => 'def', :price => rand * 100])items.insert(:json => JSON[:name => 'ghi', :price => rand * 100])items.insert(:json => JSON[:name => 'jkl', :price => rand * 100])items.insert(:json => JSON[:name => 'mno', :price => rand * 100])items.insert(:json => JSON[:name => 'pqr', :price => rand * 100])items.insert(:json => JSON[:name => 'stu', :price => rand * 100])items.insert(:json => JSON[:name => 'vwx', :price => rand * 100])items.insert(:json => JSON[:name => 'yz_', :price => rand * 100])add_comma = falseputs '['items.each do |item|  puts ',' if add_comma  add_comma ||= true  print JSON[    JSON[      item[:json]    ].merge('foo' => 'bar', 'time' => Time.now.to_f)  ]endputs "\n]"

Which generates:

[{"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606},{"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664},{"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702},{"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732},{"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758},{"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803},{"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831},{"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862},{"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894}]

I added the timestamp so you can see that each row is processed individually, AND to give you an idea how fast the rows are being processed. Granted, this is a tiny, in-memory database, which has no network I/O to content with, but a normal network connection through a switch to a database on a reasonable DB host should be pretty fast too. Telling the ORM to read the DB in chunks can speed up the processing because the DBM will be able to return larger blocks to more efficiently fill the packets. You'll have to experiment to determine what size chunks you need because it will vary based on your network, your hosts, and the size of your records.

Your original design isn't good when dealing with enterprise-sized databases, especially when your hardware resources are limited. Over the years we've learned how to parse BIG databases, which make 20,000 row tables appear miniscule. VM slices are common these days and we use them for crunching, so they're often the PCs of yesteryear: single CPU with small memory footprints and dinky drives. We can't beat them up or they'll be bottlenecks, so we have to break the data into the smallest atomic pieces we can.

Harping about DB design: Storing JSON in a database is a questionable practice. DBMs these days can spew JSON, YAML and XML representations of rows, but forcing the DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs unless you also have the equivalent lookup data indexed in separate fields so your searches are at the highest possible speed. If the data is available in separate fields, then doing good ol' database queries, tweaking in the DBM or your scripting language of choice, and emitting the massaged data becomes a lot easier.

ruby json parsing memory io

It is possible via JSON::Stream or Yajl::FFI gems. You will have to write your own callbacks though. Some hints on how to do that can be found here and here.

Facing a similar problem I have created the json-streamer gem that will spare you the need to create your own callbacks. It will yield you each object one by one removing it from the memory afterwards. You could then pass these to another IO object as intended.

ruby json parsing memory io

There is a library called oj that does exactly that. It can do parsing and generation. For example, for parsing you can use Oj::Doc:

Oj::Doc.open('[3,[2,1]]') do |doc|    result = {}    doc.each_leaf() do |d|        result[d.where?] = d.fetch()    end    resultend #=> ["/1" => 3, "/2/1" => 2, "/2/2" => 1]

You can even backtrack in the file using doc.move(path). it seems very flexible.

For writing documents, you can use Oj::StreamWriter:

require 'oj'doc = Oj::StreamWriter.new($stdout)def write_item(doc, item)  doc.push_object  doc.push_key "type"  doc.push_value "item"  doc.push_key "value"  doc.push_value item  doc.popenddef write_array(doc, array)  doc.push_object  doc.push_key "type"  doc.push_value "array"  doc.push_key "value"  doc.push_array  array.each do |item|    write_item(doc, item)  end  doc.pop  doc.popendwrite_array(doc, [{a: 1}, {a: 2}]) #=> {"type":"array","value":[{"type":"item","value":{":a":1}},{"type":"item","value":{":a":2}}]}

CodeHunter

Stream based parsing and writing of JSON

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last