Increasing memory limit for mongodb aggregate jobs Increasing memory limit for mongodb aggregate jobs mongodb mongodb

Increasing memory limit for mongodb aggregate jobs


Well no there is no setting and if you really think about it there is good reason for this. So if you first consider what aggregate is doing and what MongoDB does in general it should become clear.

This is what "should" be at the "head" of any sensible aggregation pipeline:

db.collection.aggregate([    { "$match:{ /* Something here */ } },

And these are the reasons:

  1. It makes good sense to try to reduce the working set that you are operating on in any operation.

  2. This is also the only time you get the opportunity to use an index to aid in searching the selection. Which is always better than a collection scan.

  3. Even though there is a built in "optimizer" that looks for such things as "projections" limiting the "selected" fields, the best scrutineer of working set size is to only work on the valid records. Later stage matches are not "optimized" in this way.(See point 1)

The next thing to consider is the general behavior of MongoDB. So that the server process wants to do, is "consume" as much of the available machine memory as it can in order to hold the "working set" data ( collections and/or index ) in order to "work" on that data in the most efficient means.

So it really is in the "best interests" of the database engine to "spend" most of it's memory allocation in this way. As in that way, both your "aggregate" job and all of the other concurrent processes have access to the "working data" in the memory space.

So therefore it is "not optimal" for MongoDB to "steal" this memory allocation away from the other concurrent operations just to service your running aggregation operation.

In the "programming to hardware requirements" terms, well you are aware that future releases allow the aggregation pipeline to implement "disk use" in order to allow larger processing. You can always implement SSD's or other fast storage technologies. And of course "10%" of RAM is subjective to the amount of RAM that is installed in a system. So you can always increase that.

The roundup of this is, MongoDB has an actual job of being a "concurrent datastore" and does that well. What it is not is a specific "aggregation job-runner" and should not be treated as such.

So either "break-up" your workloads, or increase your hardware spec, or simply switch the large "task running" activity to something that does focus on the running job such as a Hadoop-style "mapReduce", and leave MongoDB to it's job of serving the data.

Or of course, change your design to simply "pre-aggregate" the required data somewhere "on write".

As the saying goes, "Horses for courses", or use your tools for what they were designed for.


There are 3 solutions I can think of:

Solution 1 (Mongodb):$sort by id. Use $group only to count the amount of documents per unique id. Then use $facet with $limit and $skip to split up the operation. I haven't done that, but I am about to try since I am experiencing the same problem as you.

Solution 1 (Mongodb):Depending on the queries of your use case and the number of distinct id's you could create a collection for each id. It's ugly, but if each request is only targetting specific id's it might make sense.Especially with a low number of id's there might be no other workaround since there is a hard limit to document size as well (16MB I think). A large enough document array nested inside a document is not an option I suppose.

Solution 3:This solution would be writing your own highly customized database solution. Allocate memory, ideally a large sequential block. Lay out all the data it can carry in struct arrays within that buffer which can remain in memory at all times or be written back to a fast disk.

Perform most operations on memory in milliseconds time.

If the data is too large to keep it in memory for an entire operation, you can map files to memory and use views of those mapped files. I'm using 2 small NVME's in another project which cost me 40€ each. Together they can write 15GB to disk in less than 3 seconds which is the size of the buffer I need for the entire dataset.

I was convinced by a friend to give mongodb a try in order to scrape wildly variable data and am currently handling a 200MB collection which will eventually grow to 100 times that size. The operation I'm performing is simply grouping into documents by id and pushing the roots into an array. Allowdiskuse is set to true, but other than the wording of the error message this seems to make no difference.

I really don't see the issue with letting a user adjust the memory limit to his needs. Even if people will say using 16GB of memory is a bad idea, there are plenty of scenarios in which it definitely is not.As adults we should all be allowed to make our own mistakes. No operation on a 200MB database should ever be an issue, especially not one as simple as the one I am performing - or failing to perform.