How can I process huge JSON files as streams in Ruby, without consuming all memory? How can I process huge JSON files as streams in Ruby, without consuming all memory? json json

How can I process huge JSON files as streams in Ruby, without consuming all memory?


Problem

json = Yajl::Parser.parse(file_stream)

When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.

Solution

Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.

The example given in the README is:

Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!

(Assume we're in an EventMachine::Connection instance)

def post_init  @parser = Yajl::Parser.new(:symbolize_keys => true)enddef object_parsed(obj)  puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"  puts obj.inspectenddef connection_completed  # once a full JSON object has been parsed from the stream  # object_parsed will be called, and passed the constructed object  @parser.on_parse_complete = method(:object_parsed)enddef receive_data(data)  # continue passing chunks  @parser << dataend

Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.

obj = Yajl::Parser.parse(str_or_io)

One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.

Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.


Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.


Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):

def post_init  @parser = Yajl::FFI::Parser.new  @parser.start_document { puts "start document" }  @parser.end_document   { puts "end document" }  @parser.start_object   { puts "start object" }  @parser.end_object     { puts "end object" }  @parser.start_array    { puts "start array" }  @parser.end_array      { puts "end array" }  @parser.key            {|k| puts "key: #{k}" }  @parser.value          {|v| puts "value: #{v}" }enddef receive_data(data)  begin    @parser << data  rescue Yajl::FFI::ParserError => e    close_connection  endend

There, he sets up the callbacks for possible data events that the stream parser can experience.

Given a json document that looks like:

{  1: {    name: "fred",    color: "red",    dead: true,  },  2: {    name: "tony",    color: "six",    dead: true,  },  ...  n: {    name: "erik",    color: "black",    dead: false,  },}