Sub-query in MongoDB Sub-query in MongoDB mongodb mongodb

Sub-query in MongoDB


No Joins in here, just us bears


So MongoDB "does not do joins". You might have tried something like this in the shell for example:

db.events.find().forEach(function(event) {    event.user = db.user.findOne({ "_id": eventUser });    printjson(event)})

But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.

For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:

db.events.aggregate([    { "$match": {        "user": {             "$in": db.users.distinct("_id",{ "country": "UK" })        }    }}])

Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.

Correcting


You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:

  1. Embed your whole user object data within the event data.

  2. At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.

So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:

{    "_id": ObjectId("yyyyy"),     "createdAt": ISODate(),     "user": {        "_id": ObjectId("xxxxx"),        "country": "UK"    }}

Then the "aggregation" process becomes simple:

db.events.aggregate([    { "$match": {        "createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },        "user": { "$exists": true }    }},    { "$group": {        "_id": {            "year": { "$year": "$createdAt" },            "user_id": "$user._id"            "country": "$user.country"        },        "count": { "$sum": 1 }    }},    { "$group": {        "_id": "$_id.country",        "total": { "$sum": "$count" },        "distinct": { "$sum": 1 }    }}])

We're not normal


Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:

var bulk = db.events.intitializeUnorderedBulkOp(),    count = 0;db.users.find().forEach(function(user) {    // update multiple events for user    bulk.find({ "user": user._id }).update({        "$set": { "user": { "_id": user._id, "country": user.country } }    });    count++;    // Send batch every 1000     if ( count % 1000 == 0 ) {        bulk.execute();        bulk = db.events.intitializeUnorderedBulkOp();    }});// Clear any queuedif ( count % 1000 != 0 )    bulk.execute();

So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.

If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".

But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.