Percentage of OR conditions matched in mongodb Percentage of OR conditions matched in mongodb mongoose mongoose

Percentage of OR conditions matched in mongodb


Well your solution really should be MongoDB specific otherwise you will end up doing your calculations and possible matching on the client side, and that is not going to be good for performance.

So of course what you really want is a way for that to have that processing on the server side:

db.products.aggregate([    // Match the documents that meet your conditions    { "$match": {        "$or": [            {                 "features": {                     "$elemMatch": {                       "key": "Screen Format",                       "value": "16:9"                    }                }            },            {                 "features": {                     "$elemMatch": {                       "key" : "Weight in kg",                       "value" : { "$gt": "5", "$lt": "8" }                    }                }            },        ]    }},    // Keep the document and a copy of the features array    { "$project": {        "_id": {            "_id": "$_id",            "product_id": "$product_id",            "ean": "$ean",            "brand": "$brand",            "model": "$model",            "features": "$features"        },        "features": 1    }},    // Unwind the array    { "$unwind": "$features" },    // Find the actual elements that match the conditions    { "$match": {        "$or": [            {                "features.key": "Screen Format",               "features.value": "16:9"            },            {                "features.key" : "Weight in kg",               "features.value" : { "$gt": "5", "$lt": "8" }            },        ]    }},    // Count those matched elements    { "$group": {        "_id": "$_id",        "count": { "$sum": 1 }    }},    // Restore the document and divide the mated elements by the    // number of elements in the "or" condition    { "$project": {        "_id": "$_id._id",        "product_id": "$_id.product_id",        "ean": "$_id.ean",        "brand": "$_id.brand",        "model": "$_id.model",        "features": "$_id.features",        "matched": { "$divide": [ "$count", 2 ] }    }},    // Sort by the matched percentage    { "$sort": { "matched": -1 } }])

So as you know the "length" of the $or condition being applied, then you simply need to find out how many of the elements in the "features" array match those conditions. So that is what the second $match in the pipeline is all about.

Once you have that count, you simply divide by the number of conditions what were passed in as your $or. The beauty here is that now you can do something useful with this like sort by that relevance and then even "page" the results server side.

Of course if you want some additional "categorization" of this, all you would need to do is add another $project stage to the end of the pipeline:

    { "$project": {        "product_id": 1        "ean": 1        "brand": 1        "model": 1,        "features": 1,        "matched": 1,        "category": { "$cond": [            { "$eq": [ "$matched", 1 ] },            "100",            { "$cond": [                 { "$gte": [ "$matched", .7 ] },                "70-99",                { "$cond": [                   "$gte": [ "$matched", .4 ] },                   "40-69",                   "under 40"                ]}             ]}        ]}    }}

Or as something similar. But the $cond operator can help you here.

The architecture should be fine as you have it as you can have a compound index on the "key" and "value" for the entries in your features array and this should scale well for queries.

Of course if you actually need something more than that, such as faceted searching and results, you can look at solutions like Solr or elastic search. But the full implementation of that would be a bit lengthy for here.


I'm assuming that you'd like to compare the rest of the collection to a given product, which is a textbook example of aggregation:

lookingat = db.products.findOne({product_id:'50862224'})matches = db.products.aggregate([    { $unwind: '$features' },    { $match: { features: { $in: lookingat.features }}},    { $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},    { $sort: { matchedfeatures: -1 }},    { $limit: 5 },    { $project: { _id:0, product_id: '$_id',                  pctmatch: { $multiply: [ '$matchedfeatures',                                           100/lookingat.features.length ]}      }}])

Walking through this briefly from the perspective of a product in the collection that has 6 features, and comparing it to the target product ('lookingat') which has 4 features, 3 of which match:

  1. $unwind turns 1 document with 6 features into 6 otherwise-identical documents with 1 feature each
  2. $match looks for that feature in the target's feature array (be aware that two documents are "equal" only if they have the same field names and values, in the same order), discards the 3 that don't match, and passes along the 3 that do
  3. $group consumes those 3 matching documents and produces a new one that tells you there were 3 documents that matched that product_id
  4. $sort and $limit give you the most relevant results and leave behind all those 1-feature matches you were concerned about
  5. $project lets you rename the _id from the $group step back to product_id and also math the number of matching features into a percentage (we avoided a $divide operation by recognizing that 2 of the 3 terms in our calculation are constants and can be divided in JS)