Solution for finding "duplicate" records involving STI and parent-child relationship Solution for finding "duplicate" records involving STI and parent-child relationship ruby ruby

Solution for finding "duplicate" records involving STI and parent-child relationship


The following SQL seems to do the trick

big_query = "  SELECT EXISTS (    SELECT 1    FROM buyables b1      JOIN buyables b2        ON b1.shop_week_id = b2.shop_week_id        AND b1.location_id = b2.location_id    WHERE      b1.parent_id != %1$d      AND b2.parent_id = %1$d      AND b1.type = 'Item'      AND b2.type = 'Item'    GROUP BY b1.parent_id    HAVING COUNT(*) = ( SELECT COUNT(*) FROM buyables WHERE parent_id = %1$d AND type = 'Item' )  )"

With ActiveRecord, you can get this result using select_value:

class Basket < Buyable  def has_duplicate    !!connection.select_value( big_query % id )  endend

I am not so sure about performance however


If you want to make this as efficient as possible, you should consider creating a hash that encodes basket contents as a single string or blob, add a new column containing the hash (which will need to be updated every time the basket contents change, either by the app or using a trigger), and compare hash values to determine possible equality. Then you might need to perform further comparisons (as described above) in order

What should you use for a hash though? If you know that the baskets will be limited in size, and the ids in question are bounded integers, you should be able to hash to a string that is enough in itself to test for equality. For example, you could base64 encode each shop_week and location, concatenate with a separator not in base64 (like "|"), and then concatenate with the other basket items. Build an index on the new hash key, and comparisons will be fast.


To clarify my query, and somewhat vague description of the table columns of the "buyable" table, The "Parent_ID" is the basket in question. The "Shop_Week_ID" is the consideration for baskets to be compared... don't compare a basket from week 1 to week 2 to week 3. The #ID column appears to be a sequential ID in the table, but not the actual ID of the item to be compared... The Location_ID appears to be the common "Item". In the scenario, assuming a shopping cart, Location_ID = 103 = "Computer", Location_ID = 204 = "Television" (just for my interpretation of the data). If this is incorrect, minor adjustments may be needed, in addition to the original poster showing a list of say... a dozen entries of the data to show proper correlation.

So, now, on to my query.. I'm doing a STRAIGHT_JOIN so it joins in the order I've listed.

The first query for alias "MainBasket" is exclusively used to query how many items are in the basket in question ONCE, so it doesn't need to be re-joined/queried again for each possible basket to match. There is no "ON" clause as this will be a single record, and thus no Cartesian impact, as I want this COUNT(*) value applied to EVERY record in the final result.

The NEXT Query is to find a DISTINCT OTHER Basket where at LEAST ONE "Location_ID" (Item) within the same week as the parent in question... This could result in other baskets having 1, same or more entries than the basket. But if there are 100 baskets, but only 18 have at least 1 entry that matches 1 item in the original basket, you've just significantly cut down the number of baskets to do final compare against (SameWeekSimilar alias result).

Finally is a Join to the buyable table again, but based on a join for the SameWeekSimilar, but only on per "other" basket that had a close match... No specific items, just by the basket. The query used to get the SameWeekSimilar already pre-qualified the same week, and at least one matching item from the original basket in question, but specifically excluding the original basket so it doesn't compare to itself.

By doing a group at the outer level based on the SameWeekSimilar.NextBasket, we can get the count of actual items for that basket. Since a simple Cartesian join to the MainBasket, we just grab the original count.

Finally, the HAVING clause. Since this is applied AFTER the "COUNT(*)", we know how many items were in the "Other" baskets, and how many in the "Main" basket. So, the HAVING clause is only including those where the counts were the same.

If you want to test to ensure what I'm describing, run this against your table but DO NOT include the HAVING clause. You'll see which were all the POSSIBLE... Then re-add the HAVING clause and see which ones DO match same count...

select STRAIGHT_JOIN      SameWeekSimilar.NextBasket,      count(*) NextBasketCount,      MainBasket.OrigCount   from       ( select count(*) OrigCount           from Buyable B1           where B1.Parent_ID = 7 ) MainBasket      JOIN      ( select DISTINCT              B2.Parent_ID as NextBasket           from              Buyable B1                 JOIN Buyable B2                    ON B1.Parent_ID != B2.Parent_ID                   AND B1.Shop_Week_ID = B2.Shop_Week_ID                   AND B1.Location_ID = B2.Location_ID           where              B1.Parent_ID = 7 ) SameWeekSimilar       Join Buyable B1          on SameWeekSimilar.NextBasket = B1.Parent_ID    group by       SameWeekSimilar.NextBasket    having       MainBasket.OrigCount = NextBasketCount