Fastest way to remove duplicate documents in mongodb Fastest way to remove duplicate documents in mongodb mongodb mongodb

Fastest way to remove duplicate documents in mongodb


dropDups: true option is not available in 3.0.

I have solution with aggregation framework for collecting duplicates and then removing in one go.

It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.

a. Remove all documents in one go

var duplicates = [];db.collectionName.aggregate([  { $match: {     name: { "$ne": '' }  // discard selection criteria  }},  { $group: {     _id: { name: "$name"}, // can be grouped on multiple properties     dups: { "$addToSet": "$_id" },     count: { "$sum": 1 }   }},  { $match: {     count: { "$gt": 1 }    // Duplicates considered as count greater than one  }}],{allowDiskUse: true}       // For faster processing if set is larger)               // You can display result until this and check duplicates .forEach(function(doc) {    doc.dups.shift();      // First element skipped for deleting    doc.dups.forEach( function(dupId){         duplicates.push(dupId);   // Getting all duplicate ids        }    )})// If you want to Check all "_id" which you are deleting else print statement not neededprintjson(duplicates);     // Remove all duplicates in one go    db.collectionName.remove({_id:{$in:duplicates}})  

b. You can delete documents one by one.

db.collectionName.aggregate([  // discard selection criteria, You can remove "$match" section if you want  { $match: {     source_references.key: { "$ne": '' }    }},  { $group: {     _id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties     dups: { "$addToSet": "$_id" },     count: { "$sum": 1 }   }},   { $match: {     count: { "$gt": 1 }    // Duplicates considered as count greater than one  }}],{allowDiskUse: true}       // For faster processing if set is larger)               // You can display result until this and check duplicates .forEach(function(doc) {    doc.dups.shift();      // First element skipped for deleting    db.collectionName.remove({_id : {$in: doc.dups }});  // Delete remaining duplicates})


Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:

db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true}) 

As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.

UPDATE

This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).


Create collection dump with mongodump

Clear collection

Add unique index

Restore collection with mongorestore