What is the correct way to do a HAVING in a MongoDB GROUP BY?

New answer using Mongo aggregation framework

After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:

db.col.aggregate( [   { $group: { _id: { userId: "$userId", name: "$name" },               count: { $sum: 1 } } },   { $match: { count: { $gt: 1 } } },   { $project: { _id: 0,                  userId: "$_id.userId",                  name: "$_id.name",                  count: 1}}] )

10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.

group-by mongodb having having-clause

The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.

The original command

db.getCollection('so').explain(1).aggregate( [   { $group: { _id: { userId: "$userId", name: "$name" },               count: { $sum: 1 } } },   { $match: { count: { $gt: 1 } } },   { $project: { _id: 0,                  userId: "$_id.userId",                  name: "$_id.name",                  count: 1}}] )

Parts from the explain plan

{    "stages" : [         {            "$cursor" : {                "queryPlanner" : {                    "plannerVersion" : 1,                    "namespace" : "5fa42c8b8778717d277f67c4_test.so",                    "indexFilterSet" : false,                    "parsedQuery" : {},                    "queryHash" : "F301762B",                    "planCacheKey" : "F301762B",                    "winningPlan" : {                        "stage" : "PROJECTION_SIMPLE",                        "transformBy" : {                            "name" : 1,                            "userId" : 1,                            "_id" : 0                        },                        "inputStage" : {                            "stage" : "COLLSCAN",                            "direction" : "forward"                        }                    },                    "rejectedPlans" : []                },                "executionStats" : {                    "executionSuccess" : true,                    "nReturned" : 6000,                    "executionTimeMillis" : 8,                    "totalKeysExamined" : 0,                    "totalDocsExamined" : 6000,

The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations

Now a small, hack and addition of an index.

 db.getCollection('so').createIndex({userId : 1, name : 1})

New Command

db.getCollection('so').explain(1).aggregate( [    {$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},   { $group: { _id: { userId: "$userId", name: "$name" },               count: { $sum: 1 } } },   { $match: { count: { $gt: 1 } } },   { $project: { _id: 0,                  userId: "$_id.userId",                  name: "$_id.name",                  count: 1}}] )

Explain Plan

{  "stages": [{        "$cursor": {          "queryPlanner": {            "plannerVersion": 1,            "namespace": "5fa42c8b8778717d277f67c4_test.so",            "indexFilterSet": false,            "parsedQuery": {              "$and": [{                  "name": {                    "$not": {                      "$eq": null                    }                  }                },                {                  "userId": {                    "$not": {                      "$eq": null                    }                  }                }              ]            },            "queryHash": "4EF9C4D5",            "planCacheKey": "3898FC0A",            "winningPlan": {              "stage": "PROJECTION_COVERED",              "transformBy": {                "name": 1,                "userId": 1,                "_id": 0              },              "inputStage": {                "stage": "IXSCAN",                "keyPattern": {                  "userId": 1.0,                  "name": 1.0                },                "indexName": "userId_1_name_1",                "isMultiKey": false,                "multiKeyPaths": {                  "userId": [],                  "name": []                },                "isUnique": false,                "isSparse": false,                "isPartial": false,                "indexVersion": 2,                "direction": "forward",                "indexBounds": {                  "userId": [                    "[MinKey, undefined)",                    "(null, MaxKey]"                  ],                  "name": [                    "[MinKey, undefined)",                    "(null, MaxKey]"                  ]                }              }            },            "rejectedPlans": [{              "stage": "PROJECTION_SIMPLE",              "transformBy": {                "name": 1,                "userId": 1,                "_id": 0              },              "inputStage": {                "stage": "FETCH",                "filter": {                  "userId": {                    "$not": {                      "$eq": null                    }                  }                },                "inputStage": {                  "stage": "IXSCAN",                  "keyPattern": {                    "name": 1.0                  },                  "indexName": "name_1",                  "isMultiKey": false,                  "multiKeyPaths": {                    "name": []                  },                  "isUnique": false,                  "isSparse": false,                  "isPartial": false,                  "indexVersion": 2,                  "direction": "forward",                  "indexBounds": {                    "name": [                      "[MinKey, undefined)",                      "(null, MaxKey]"                    ]                  }                }              }            }]          },          "executionStats": {            "executionSuccess": true,            "nReturned": 6000,            "executionTimeMillis": 9,            "totalKeysExamined": 6000,            "totalDocsExamined": 0,            "executionStages": {              "stage": "PROJECTION_COVERED",              "nReturned": 6000,

Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally

*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.

CodeHunter

What is the correct way to do a HAVING in a MongoDB GROUP BY?

New answer using Mongo aggregation framework

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last