How to stream creation of a JSON File? How to stream creation of a JSON File? json json

How to stream creation of a JSON File?


I think there might be couple of issues. Firstly I would suggest you do some profiling.

    // HUGE DATABASE DUMP HERE, needs to be converted to JSON, after getting all columns of all tables...    echo 'Start Time: '. date("Y-m-d H:i:s");    echo ' Memory Usage: ' . (memory_get_usage()/1048576) . ' MB \n';    $orders_query = $wpdb->get_results('        SELECT ' . $select_data . '        FROM ' . $wpdb->posts . ' AS p        INNER JOIN ' . $wpdb->postmeta . ' AS pm ON (pm.post_id = p.ID)        LEFT JOIN ' . $wpdb->prefix . 'woocommerce_order_items AS oi ON (oi.order_id = p.ID)        LEFT JOIN ' . $wpdb->prefix . 'woocommerce_order_itemmeta AS oim ON (oim.order_item_id = oi.order_item_id)        WHERE p.post_type = "shop_order"' . (!empty($exclude_post_statuses) ? ' AND p.post_status NOT IN ("' . implode('","', $exclude_post_statuses) . '")' : '') . (!empty($start_date) ? ' AND post_date >= "' . $start_date->format('Y-m-d H:i:s') . '"' : '') . (!empty($end_date) ? ' AND post_date <= "' . $end_date->format('Y-m-d H:i:s') . '"' : '') . '        ORDER BY p.ID ASC', ARRAY_A);    echo 'End Time: '. date("Y-m-d H:i:s");    echo ' Memory Usage: ' . (memory_get_usage()/1048576) . ' MB \n';    die('Finished');    $json = array();

The above will help you to know how much memory is in use, upto this point. If it fails before it prints 'Finished', we know it is not a json issue. If the script works fine then we can first create a csv file rather json. Since you are running a select query, (at this point) it does not have to be nested json file which you require. A flat structure can be achieved by just creating a CSV file.

$csvFile = uniqid('orders') . '.csv';$fp = fopen($csvFile, 'w');if (!empty($orders_query)){    $firstRow = true;    foreach($orders_query as $order_query)    {        if(true === $firstRow) {            $keys = array_keys($order_query);            fputcsv($fp, $order_query);            $firstRow = false;        }        fputcsv($fp, $order_query);    }}fclose($fp);

If the above works fine you at-least have a csv file to work with.

At this point I am not sure how complex is your data structure nested. For e.g how many distinct values exist for 'p_post_type' and 'p_post_name' you are having. You might require to parse the csv file and create multiple json file for each ['p_post_type']['p_post_name']['posts'], ['p_post_type']['p_post_name']['posts'], ['p_post_type']['p_post_name']['woocommerce_order_items'] and ['p_post_type']['p_post_name']['woocommerce_order_itemmeta'].

If the number of files are few you can write a script to merge them automatically or do them manually. If you have too many nested items, the number of json files that might be created might be a lot and might be hard to merge them and might not be a feasible option.

If the number of json files are lot, I would like to know what is the purpose of having such a huge single json file. If export is an issue import might be an issue too, especially ingesting such a huge json file in memory. I believe if the purpose of creating the json file is to import it in some form, at some stage in future, I think you might have to look at the option of just having a csv file instead, which you use to filter out whatever is required at that point of time.

I hope this helps.

FURTHER UPDATE

It looks to me that $wpdb->get_results is using mysqli_query/mysql_query (depending on your configuration) to fetch the results. See wordpress query docs. It is not memory efficient way to fetch the data this way. I believe you might be failing at this point ($wpdb->get_results) itself. I would suggest you to run the query without using $wpdb. There is a concept of unbuffered query whenever large data retrieval is required, which has very low impact on the memory. Further information can be found here mysql unbuffering.

Even if you get past this point, you will still run into memory issues, due to the way how you are storing everything in $json variable which is eating up lot of your memory. $json is an array and it would interesting to know how PHP array works. PHP arrays are dynamic and they do not allocate extra memory every time a new element is added, since that would be extremely slow. It instead, increases the array size to the power of two, which means whenever the limit is exhausted it increases the array limit to twice its current limit and in the process tries to increase the memory to twice the limit. This has been less of an issue with PHP 7, since they have made some major changes to the php core. So if you have 2GB data that might be required to be stored in $json, the script might easily allocate anywhere between 3-4 GB memory, depending upon when it hits the limit. Further details can be found here php array and How does PHP memory actually work

If you consider the overhead of the $orders_query which is an array combined with overhead of $json it is quite substantial due to the way PHP array works.

You can also try to create another database B. So while you are reading from database A, you simultaneously start writing data to database B. In the end you have database B with all the data in it with the power of MySQL. You could also push the same data into a MongoDB which would be lightning fast and might help you with the json nesting you are after. MongoDB is meant to work really efficiently with large datasets.

JSON STREAMING SOLUTION

Firstly, I would like to say that streaming is sequential/linear process. As such, it is does not have memory of what was added before this point of time or what will added after this point of time. It works in small chunks and that is the reason it is so memory efficient. So when you actually write or read, the responsibility lies with the script, that it maintains a specific order, which is kind of saying you are writing/reading your own json, as streaming only understands text and has no clue about what json is and won't bother itself in writing/reading a correct one.

I have found a library on github https://github.com/skolodyazhnyy/json-stream which would help in you achieving what you want. I have experimented with the code and I can see it will work for you with some tweaks in your code.

I am going to write some pseudo-code for you.

//order is important in this query as streaming would require to maintain a proper order.$query1 = select distinct p_post_type from ...YOUR QUERY... order by p_post_type;$result1 = based on $query1; $filename = 'data.json';$fh = fopen($filename, "w");$writer = new Writer($fh);$writer->enter(Writer::TYPE_OBJECT);  foreach($result1 as $fields1) {    $posttype = $fields1['p_post_type'];    $writer->enter($posttype, Writer::TYPE_ARRAY);     $query2 = select distinct p_post_name from ...YOUR QUERY... YOUR WHERE ... and p_post_type= $posttype order by p_post_type,p_post_name;    $result2 = based on $query2;    foreach($result2 as $fields2) {        $postname = $fields1['p_post_name'];        $writer->enter($postname, Writer::TYPE_ARRAY);         $query3 = select ..YOUR COLUMNS.. from ...YOUR QUERY... YOUR WHERE ... and p_post_type= $posttype and p_post_name=$postname where p_ID is not null order by p_ID;        $result3 = based on $query3;        foreach($result2 as $field3) {            $writer->enter('posts', Writer::TYPE_ARRAY);             // write an array item            $writer->write(null, $field3);        }        $writer->leave();         $query4 = select ..YOUR COLUMNS.. from ...YOUR QUERY... YOUR WHERE ... and p_post_type= $posttype and p_post_name=$postname where pm_meta_id is not null order by pm_meta_id;        $result4 = based on $query4;        foreach($result4 as $field4) {            $writer->enter('postmeta', Writer::TYPE_ARRAY);             // write an array item            $writer->write(null, $field4);        }       $writer->leave();         $query5 = select ..YOUR COLUMNS.. from ...YOUR QUERY... YOUR WHERE ... and p_post_type= $posttype and p_post_name=$postname where oi_order_item_id is not null order by oi_order_item_id;        $result5 = based on $query5;        foreach($result5 as $field5) {            $writer->enter('woocommerce_order_items', Writer::TYPE_ARRAY);             // write an array item            $writer->write(null, $field5);        }        $writer->leave();         $query6 = select ..YOUR COLUMNS.. from ...YOUR QUERY... YOUR WHERE ... and p_post_type= $posttype and p_post_name=$postname where oim_meta_id is not null order by oim_meta_id;        $result6 = based on $query6;        foreach($result6 as $field6) {            $writer->enter('woocommerce_order_itemmeta', Writer::TYPE_ARRAY);             // write an array item            $writer->write(null, $field5);        }        $writer->leave();     }$writer->leave(); fclose($fh);

You might have to start limiting your queries to 10 something until you get it right. Since the code above might not just work as it is. You should be able to read the code in similar fashion as the same library has got a Reader class to help. I have tested both reader and writer and they seem to work fine.


Creating the file

The problem with your code is you are trying to fit whole dataset into the memory, which eventually will fail as soon as your database gets large enough. To overcome this you have to fetch the data in batches.

We are going to generate the query multiple times so I extracted your query into a function. I skipped passing required parameters though (or making them global if you will) for brevity so you have to get this to work by yourself.

function generate_query($select, $limit = null, $offset = null) {    $query = 'SELECT ' . $select . '    FROM ' . $wpdb->posts . ' AS p    INNER JOIN ' . $wpdb->postmeta . ' AS pm ON (pm.post_id = p.ID)    LEFT JOIN ' . $wpdb->prefix . 'woocommerce_order_items AS oi ON (oi.order_id = p.ID)    LEFT JOIN ' . $wpdb->prefix . 'woocommerce_order_itemmeta AS oim ON (oim.order_item_id = oi.order_item_id)    WHERE p.post_type = "shop_order"' . (!empty($exclude_post_statuses) ? ' AND p.post_status NOT IN ("' . implode('","', $exclude_post_statuses) . '")' : '') . (!empty($start_date) ? ' AND post_date >= "' . $start_date->format('Y-m-d H:i:s') . '"' : '') . (!empty($end_date) ? ' AND post_date <= "' . $end_date->format('Y-m-d H:i:s') . '"' : '') . '    ORDER BY p.ID ASC';    if ($limit && $offset) {        $query .= ' LIMIT ' . $limit . ' OFFSET ' . $offset;    }    return $query;}

Now we will get results from the db in batches, we define the batch count that is the number of records per iteration that we will load into the memory. You can later on play with this value to find one that will be fast enough and won't make PHP crash. Keep in mind we want to reduce the number of database queries as much as possible:

define('BATCH_COUNT', 500);

Before we create the loop we need to know how many iterations (database calls) we will make, so we need the total order count. Having this and the batch count, we can calculate this value easily:

$orders_count = $wpdb->get_col(generate_query('COUNT(*)'));$iteration_count = ceil($orders_count / BATCH_COUNT);

As a result we would like to have a huge JSON string inside the result file. Since with each iteration we will have a separate JSON containing an array of objects, we will simply strip the [ and ] from each side of the JSON string and put that string into the file.

Final code:

define('FILE', 'dump.json');file_put_contents(FILE, '[');for ($i = 0; $i < $iteration_count; $i++) {    $offset = $i * BATCH_COUNT;    $result = $wpdb->get_results(        generate_query($select_data, BATCH_COUNT, $offset),        ARRAY_A    );    // do additional work here, add missing arrays etc.    // ...    // I assume here the $result is a valid array ready for    // creating JSON from it    // we append the result file with partial JSON    file_put_contents(FILE, trim(json_encode($result), '[]'), FILE_APPEND);}file_put_contents(FILE, ']', FILE_APPEND);

Congratulations, you have just created your first huge JSON dump ;) You should run this script in the command line so it can get as long as it needs to, there's no need to modify the memory limit from now on, because we are never going to hit the limit hopefully.

Sending the file

Streaming large files with PHP is easy and has already been answered on SO many times. However I personally don't recommend you doing anything time consuming in PHP, because it sucks as a long running process, either in the command line or as a file server.

I assume you are using Apache. You should consider using SendFile and let Apache do the hard work for you. This method is far more efficient when dealing with huge files. This method is very easy, all you need to do is pass the path to the file in the header:

header('X-Sendfile: ' . $path_to_the_file);

Should you use Nginx there's XSendFile support as well.

This method does not use a lot of memory, does not block the PHP process. The file does not need to be accessible in the webroot too. I use XSendFile all the time to serve 4K videos to authenticated users.


First, you should ask yourself a question: Do I need to write database dump myself?

If not then you can simply use some service that will do the work for you. Mysqldump-php should be able to do the job.

Then you can simply:

include_once(dirname(__FILE__) . '/mysqldump-php-2.0.0/src/Ifsnop/Mysqldump/Mysqldump.php');$dump = new Ifsnop\Mysqldump\Mysqldump('mysql:host=localhost;dbname=testdb', 'username', 'password');$dump->start('storage/work/dump.sql');

This should create .sql file. However, you wanted json file. That shouldn't be a problem though. This tool will do the rest of the job: http://www.csvjson.com/sql2json

You can also find the source code of sql2json on github: https://github.com/martindrapeau/csvjson-app