How to select similar sets in SQL How to select similar sets in SQL sql sql

How to select similar sets in SQL


You specify

How can I write a query that can select all orders that are at least 85% similar to a specific order?

This is an important simplification compared with 'all pairs of ordersthat are at least 85% similar to each other'.

We'll use some TDQD (Test-Driven Query Design) and some analysis to help us.

Preliminaries

To be remotely similar, the two orders must have at least one item incommon. This query can be used to determine which orders have at leastone item in common with a specified order:

SELECT DISTINCT I1.OrderID AS ID  FROM OrderItem AS I1  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID>

This prunes the list of other orders to be examined quite a lot, thoughif the specified order included one of your most popular items, it'slikely that a lot of other orders also did so.

Instead of the DISTINCT, you could use:

SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common  FROM OrderItem AS I1  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID

This gives you the number of items in an order that it has in commonwith the specified order. We also need the number of items in eachorder:

SELECT OrderID AS ID, COUNT(*) AS Num_Total  FROM OrderItem GROUP BY OrderID;

Identical Orders

For 100% similarity, the two orders would have as many items in commonas each has items. This would probably not find many pairs of orders,though. We can find the orders with exactly the same items as thespecified order easily enough:

SELECT L1.ID  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total          FROM OrderItem         GROUP BY OrderID       ) AS L1  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common          FROM OrderItem AS I1          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>         WHERE I1.OrderID != <specified order ID>         GROUP BY I1.OrderID       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;

Edit: This turns out not to be stringent enough; for the orders to be identical, the number of items in the specified order must also be the same as the number in common:

SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total          FROM OrderItem         GROUP BY OrderID       ) AS L1  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common          FROM OrderItem AS I1          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>         WHERE I1.OrderID != <specified order ID>         GROUP BY I1.OrderID       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common  JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total          FROM OrderItem         WHERE OrderID = <specified order ID>         GROUP BY OrderID       ) AS L3 ON L2.Num_Common = L3.Num_Total;

Similar Orders — Analyzing the Formula

Applying the Jaccard Similarityas defined at Wikipedia to two orders A and B, with |A| being the countof the number of items in order A, the Jaccard Similarity J(A,B) =|A∩B| ÷ |A∪B|, where |A∩B| is the number of items in common tothe two orders and |A∪B| is the total number of different itemsordered.

To meet an 85% Jaccard Similarity criterion, if the number of items ineither order is less than some threshold, the orders must be identical.For example, if both orders A and B have 5 items, say, but there's oneitem different between the two, it gives you 4 items in common (|A∩B|)and 6 items in total (|A∪B|), so the Jaccard Similarity J(A,B) is only66⅔%.

For 85% similarity when there are N items in each of the two orders and1 item is different, (N-1) ÷ (N+1) ≥ 0.85, which means N > 12(12⅓ to be precise). For a fraction F = J(A,B), one item differentmeans (N-1) ÷ (N+1) ≥ F which can be solved for N giving N ≥ (1+ F) ÷ (1 - F). As the similarity requirement goes up, the ordersmust be identical for increasingly large values of N.

Generalizing still further, let's suppose we have different size orderswith N and M items (without loss of generality, N < M). The maximumvalue of |A∩B| is now N and the minimum value of |A∪B| is M (meaningall the items in the smaller order appear in the larger order). Let'sdefine that M = N + ∆, and that there are ∂ items present in thesmaller order that are not present in the larger order. It follows thatthere are ∆+∂ items present in the larger order that are not in thesmaller order.

By definition, then, |A∩B| = N-∂, and |A∪B| = (N-∂) + ∂ +(N+∆-(N-∂)), where the three added terms represent (1) the number ofitems in common between the two orders, (2) the number of items only inthe smaller order, and (3) the number of items only in the larger order.This simplifies to: |A∪B| = N+∆+∂.


Key Equation

For a similarity fraction F, we're interested in pairs of orders whereJ(A,B) ≥ F, so:

(N-∂) ÷ (N+∆+∂) ≥ F

F ≤ (N-∂) ÷ (N+∆+∂)


We can use a spreadsheet to graph the relationship between these. For agiven number of items in the smaller order (x-axis), and for a givensimilarity, we can graph the maximum value of ∂ that gives us asimilarity of F. The formula is:

∂ = (N(1-F) - F∆) ÷ (1+F)

...plot of ∂ = (N(1-F) - F∆) ÷ (1+F)...

This is a linear equation in N and ∆ for constant F; it is non-linearfor different values of F. Clearly, ∂ has to be a non-negativeinteger.

Given F = 0.85, for orders that are the same size (∆=0), for 1 ≤ N <13, ∂ = 0; for 13 ≤ N < 25, ∂ ≤ 1; for 25 ≤ N < 37, ∂ ≤ 2,for 37 ≤ N < 50, ∂ ≤ 3.

For orders that differ by 1 (∆=1), for 1 ≤ N < 18, ∂ = 0; for 18≤ N < 31, ∂ ≤ 1; for 31 ≤ N < 43, ∂ ≤ 2; etc. If ∆=6, youneed N=47 before the orders are still 85% similar with ∂=1. Thatmeans the small order has 47 items, of which 46 are in common with thelarge order of 53 items.

Similar Orders — Applying the Analysis

So far, so good. How can we apply that theory to selecting the orderssimilar to a specified order?

First, we observe that the specified order could be the same size as asimilar order, or larger, or smaller. This complicates things a bit.

The parameters of the equation above are:

  • N – number of items in smaller order
  • ∆ — difference between number of items in larger order and N
  • F — fixed
  • ∂ — number of items in smaller order not matched in larger order

The values available using minor variations on the queries developed at the top:

  • NC — number of items in common
  • NA — number of items in specified order
  • NB — number of items in compared order

Corresponding queries:

SELECT OrderID AS ID, COUNT(*) AS NA  FROM OrderItem WHERE OrderID = <specified order ID> GROUP BY OrderID;SELECT OrderID AS ID, COUNT(*) AS NB  FROM OrderItem WHERE OrderID != <specified order ID> GROUP BY OrderID;SELECT I1.OrderID AS ID, COUNT(*) AS NC  FROM OrderItem AS I1  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID> WHERE I1.OrderID != <specified order ID> GROUP BY I1.OrderID

For convenience, we want the values N and N+∆ (and hence ∆) available, sowe can use a UNION to arrange things appropriately, with:

  • NS = N — number of items in smaller order
  • NL = N + ∆ — number of items in larger order

and in the second version of the UNION query, with:

  • NC = N - ∂ — number of items in common

Both queries keep the two order ID numbers so that you can track back tothe rest of the order information later.

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL  FROM (SELECT OrderID AS ID, COUNT(*) AS NA          FROM OrderItem         WHERE OrderID = <specified order ID>         GROUP BY OrderID       ) AS v1  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB          FROM OrderItem         WHERE OrderID != <specified order ID>         GROUP BY OrderID       ) AS v2    ON v1.NA <= v2.NBUNIONSELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL  FROM (SELECT OrderID AS ID, COUNT(*) AS NA          FROM OrderItem         WHERE OrderID = <specified order ID>         GROUP BY OrderID       ) AS v1  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB          FROM OrderItem         WHERE OrderID != <specified order ID>         GROUP BY OrderID       ) AS v2    ON v1.NA > v2.NB

This gives us a table expression with columns OrderID_1, NS, OrderID_2,NL, where NS is the number of items in the 'smaller order and NL is thenumber of items in the larger order. Since there is no overlap in theorder numbers generated by the v1 and v2 table expressions, there's noneed to worry about 'reflexive' entries where the OrderID values are thesame. Adding NC to this is most easily handled in the UNION query too:

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC  FROM (SELECT OrderID AS ID, COUNT(*) AS NA          FROM OrderItem         WHERE OrderID = <specified order ID>         GROUP BY OrderID       ) AS v1  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB          FROM OrderItem         WHERE OrderID != <specified order ID>         GROUP BY OrderID       ) AS v2    ON v1.NA <= v2.NB  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC          FROM OrderItem AS I1          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>         WHERE I1.OrderID != <specified order ID>         GROUP BY I1.OrderID       ) AS v3    ON v3.ID = v2.IDUNIONSELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC  FROM (SELECT OrderID AS ID, COUNT(*) AS NA          FROM OrderItem         WHERE OrderID = <specified order ID>         GROUP BY OrderID       ) AS v1  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB          FROM OrderItem         WHERE OrderID != <specified order ID>         GROUP BY OrderID       ) AS v2    ON v1.NA > v2.NB  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC          FROM OrderItem AS I1          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>         WHERE I1.OrderID != <specified order ID>         GROUP BY I1.OrderID       ) AS v3    ON v3.ID = v1.ID

This gives us a table expression with columns OrderID_1, NS, OrderID_2,NL, NC, where NS is the number of items in the 'smaller order and NL isthe number of items in the larger order, and NC is the number of itemsin common.

Given NS, NL, NC, we are looking for orders that satisfy:

(N-∂) ÷ (N+∆+∂) ≥ F.

  • N – number of items in smaller order
  • ∆ — difference between number of items in larger order and N
  • F — fixed
  • ∂ — number of items in smaller order not matched in larger order

  • NS = N — number of items in smaller order

  • NL = N + ∆ — number of items in larger order
  • NC = N - ∂ — number of items in common

The condition, therefore, needs to be:

NC / (NL + (NS - NC)) ≥ F

The term on the LHS must be evaluated as a floating point number, not asan integer expression. Applying that to the UNION query above, yields:

SELECT OrderID_1, NS, OrderID_2, NL, NC,        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC          FROM (SELECT OrderID AS ID, COUNT(*) AS NA                  FROM OrderItem                 WHERE OrderID = <specified order ID>                 GROUP BY OrderID               ) AS v1          JOIN (SELECT OrderID AS ID, COUNT(*) AS NB                  FROM OrderItem                 WHERE OrderID != <specified order ID>                 GROUP BY OrderID               ) AS v2            ON v1.NA <= v2.NB          JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC                  FROM OrderItem AS I1                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>                 WHERE I1.OrderID != <specified order ID>                 GROUP BY I1.OrderID               ) AS v3            ON v3.ID = v2.ID        UNION        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC          FROM (SELECT OrderID AS ID, COUNT(*) AS NA                  FROM OrderItem                 WHERE OrderID = <specified order ID>                 GROUP BY OrderID               ) AS v1          JOIN (SELECT OrderID AS ID, COUNT(*) AS NB                  FROM OrderItem                 WHERE OrderID != <specified order ID>                 GROUP BY OrderID               ) AS v2            ON v1.NA > v2.NB          JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC                  FROM OrderItem AS I1                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>                 WHERE I1.OrderID != <specified order ID>                 GROUP BY I1.OrderID               ) AS v3            ON v3.ID = v1.ID       ) AS u WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

You might observe that this query only uses the OrderItem table; theOrder and Item tables are not needed.


Warning: partially tested SQL (caveat lector). The SQL above now seems to produce plausible answers on minuscule data sets. I adjusted the similarity requirement (0.25, then 0.55) and got plausible values and appropriate selectivity. However, my test data had but 8 items in the biggest order, and certainly wasn't covering the full scope of the described data. Since the DBMS I use most frequently does not support CTEs, the SQL below is untested. However, I am moderately confident that unless I made a big mistake, the CTE code in version 1 (with lots of repetition of the specified order ID) should be clean. I think version 2 may be OK too, but...it is untested.

There may be more compact ways of expressing the query, possibly usingthe OLAP functions.

If I was going to test this, I'd create a table with a fewrepresentative sets of order items, checking that the similarity measurereturned was sensible. I'd work the queries more or less as shown,gradually building up the complex query. If one of the expressions wasshown to be flawed, then I'd make appropriate adjustments in that queryuntil the flaw was fixed.

Clearly, performance will be an issue. The innermost queries are notdreadfully complex, but they aren't wholy trivial. However, measurementwill show whether it's a dramatic problem or just a nuisance. Studyingthe query plans may help. It seems very probable that there should bean index on OrderItem.OrderID; the queries are unlikely to perform wellif there isn't such an index. That is unlikely to be a problem since itis a foreign key column.

You might get some benefit out of using 'WITH clauses' (Common Table Expressions). They would make explicit the repetition that is implicit in the two halves of the UNION sub-query.


Using Common Table Expressions

Using common table expressions clarifies to the optimizer whenexpressions are the same, and may help it perform better. They alsohelp the humans reading your query. The query above does rather beg forthe use of CTEs.

Version 1: Repeating the specified order number

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)              FROM OrderItem             WHERE OrderID = <specified order ID>             GROUP BY OrderID           ),     OO AS (SELECT OrderID AS ID, COUNT(*) AS NB       -- Other orders (OO)              FROM OrderItem             WHERE OrderID != <specified order ID>             GROUP BY OrderID           ),     CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)              FROM OrderItem AS I1              JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>             WHERE I1.OrderID != <specified order ID>             GROUP BY I1.OrderID           )SELECT OrderID_1, NS, OrderID_2, NL, NC,        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC          FROM SO AS v1          JOIN OO AS v2 ON v1.NA <= v2.NB          JOIN CI AS v3 ON v3.ID  = v2.ID        UNION        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC          FROM SO AS v1          JOIN OO AS v2 ON v1.NA  > v2.NB          JOIN CI AS v3 ON v3.ID  = v1.ID       ) AS u WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

Version 2: Avoiding repeating the specified order number

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)              FROM OrderItem             WHERE OrderID = <specified order ID>             GROUP BY OrderID           ),     OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB    -- Other orders (OO)              FROM OrderItem AS OI              JOIN SO ON OI.OrderID != SO.ID             GROUP BY OI.OrderID           ),     CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)              FROM OrderItem AS I1              JOIN SO AS S1 ON I1.OrderID != S1.ID              JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID              JOIN SO AS S2 ON I2.OrderID  = S2.ID             GROUP BY I1.OrderID           )SELECT OrderID_1, NS, OrderID_2, NL, NC,        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC          FROM SO AS v1          JOIN OO AS v2 ON v1.NA <= v2.NB          JOIN CI AS v3 ON v3.ID  = v2.ID        UNION        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC          FROM SO AS v1          JOIN OO AS v2 ON v1.NA  > v2.NB          JOIN CI AS v3 ON v3.ID  = v1.ID       ) AS u WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

Neither of these is an easy read; both are easier than the big SELECT with the CTEs written out in full.


Minimal test data

This is inadequate for good testing. It gives a small modicum of confidence (and it did show up the problem with the 'identical order' query.

CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY);CREATE TABLE Item  (ID SERIAL NOT NULL PRIMARY KEY);CREATE TABLE OrderItem(    OrderID INTEGER NOT NULL REFERENCES Order,    ItemID INTEGER NOT NULL REFERENCES Item,    Quantity DECIMAL(8,2) NOT NULL);INSERT INTO Order VALUES(1);INSERT INTO Order VALUES(2);INSERT INTO Order VALUES(3);INSERT INTO Order VALUES(4);INSERT INTO Order VALUES(5);INSERT INTO Order VALUES(6);INSERT INTO Order VALUES(7);INSERT INTO Item VALUES(111);INSERT INTO Item VALUES(222);INSERT INTO Item VALUES(333);INSERT INTO Item VALUES(444);INSERT INTO Item VALUES(555);INSERT INTO Item VALUES(666);INSERT INTO Item VALUES(777);INSERT INTO Item VALUES(888);INSERT INTO Item VALUES(999);INSERT INTO OrderItem VALUES(1, 111, 1);INSERT INTO OrderItem VALUES(1, 222, 1);INSERT INTO OrderItem VALUES(1, 333, 1);INSERT INTO OrderItem VALUES(1, 555, 1);INSERT INTO OrderItem VALUES(2, 111, 1);INSERT INTO OrderItem VALUES(2, 222, 1);INSERT INTO OrderItem VALUES(2, 333, 1);INSERT INTO OrderItem VALUES(2, 555, 1);INSERT INTO OrderItem VALUES(3, 111, 1);INSERT INTO OrderItem VALUES(3, 222, 1);INSERT INTO OrderItem VALUES(3, 333, 1);INSERT INTO OrderItem VALUES(3, 444, 1);INSERT INTO OrderItem VALUES(3, 555, 1);INSERT INTO OrderItem VALUES(3, 666, 1);INSERT INTO OrderItem VALUES(4, 111, 1);INSERT INTO OrderItem VALUES(4, 222, 1);INSERT INTO OrderItem VALUES(4, 333, 1);INSERT INTO OrderItem VALUES(4, 444, 1);INSERT INTO OrderItem VALUES(4, 555, 1);INSERT INTO OrderItem VALUES(4, 777, 1);INSERT INTO OrderItem VALUES(5, 111, 1);INSERT INTO OrderItem VALUES(5, 222, 1);INSERT INTO OrderItem VALUES(5, 333, 1);INSERT INTO OrderItem VALUES(5, 444, 1);INSERT INTO OrderItem VALUES(5, 555, 1);INSERT INTO OrderItem VALUES(5, 777, 1);INSERT INTO OrderItem VALUES(5, 999, 1);INSERT INTO OrderItem VALUES(6, 111, 1);INSERT INTO OrderItem VALUES(6, 222, 1);INSERT INTO OrderItem VALUES(6, 333, 1);INSERT INTO OrderItem VALUES(6, 444, 1);INSERT INTO OrderItem VALUES(6, 555, 1);INSERT INTO OrderItem VALUES(6, 777, 1);INSERT INTO OrderItem VALUES(6, 888, 1);INSERT INTO OrderItem VALUES(6, 999, 1);INSERT INTO OrderItem VALUES(7, 111, 1);INSERT INTO OrderItem VALUES(7, 222, 1);INSERT INTO OrderItem VALUES(7, 333, 1);INSERT INTO OrderItem VALUES(7, 444, 1);INSERT INTO OrderItem VALUES(7, 555, 1);INSERT INTO OrderItem VALUES(7, 777, 1);INSERT INTO OrderItem VALUES(7, 888, 1);INSERT INTO OrderItem VALUES(7, 999, 1);INSERT INTO OrderItem VALUES(7, 666, 1);


There's really no easy answer to this. You can certainly store the Jaccard index (actually I'd just store the ones that meet the criteria, and throw out the rest), but the real problem is calculating it (effectively have to scan all of your existing order each time a new order was entered in to the system to calculate the new index).

That can be quite expensive depending on your volume of orders that's you're maintaining. Maybe you only compare it to the last year of orders, or something.

If you're doing it on the fly, it gets more interesting, but still expensive.

You can readily get a list of all orders that have the same product items. One list per item. This, in fact, is not necessarily a lot of data (if you have a lot of orders for a single popular item, then it can be a long list). The individual queries aren't particularly insane either (again depending on your data). If you have a vast vast amount of data, the query can be readily map/reduced and even work with sharded data stores. Bitmap indexes (if your DB support this) are particularly good for getting lists like this quite quickly.

Then you can simply count the times that an order number occurs in all of the lists, and then drop those off that don't meet the threshold. That's a straight forward merge operation.

But you'd have to do this calculation every single time you'd want the information, since you can't really store it.

So, it really does boil down to what you need the information for, how often you need it, your items <-> order distribution, how long you can wait for it, etc.

Addenda:

Thinking about it a little more, this is a simple query, but it may take some time to run. Likely not much with modern hardware, you don't really have that much data. For a single screen viewing an order you wouldn't notice it. If you were running report across all orders, then you would definitely notice it -- and would need a different approach.

Lets consider an order with 20 line items.

And you want an 85% match. That means orders that have 17 or more items in common.

Here is a query that will give you the orders you're interested in:

SELECT orderId, count(*) FROM OrderItemWHERE itemId in ('list', 'of', 'items', 'in', 'order', 123, 456, 789)GROUP BY orderIdHAVING count(*) >= 17

So, this gives you a collection of all the line items with the same items as your order. Then you simply sum them up by orderId, and those that are equal to or greater than your threshold (17 in this case), are the candidate orders.

Now, you don't say how many items you have in your catalog. If you have 1000 items, perfectly distributed, this query will chew on 1600 rows of data -- which is no big deal. With proper indexes this should go quite quickly. However, if you have items that are "really popular", then you're going to chew through a lot more rows of data.

But, again, you don't have that much data. Most of this query can be done within the indexes on a proper database and not even hit the actual tables. So, as I said, you'll likely not notice the impact of this query on a interactive system.

So, give it a try and see how it goes for you.


This approach takes into account the Quantity using the Extended Jaccard Coefficient or Tanimoto Similarity. It computes the similarity across all Orders, by using a the vector of common ItemIDs of magnitude Quantity. It does cost a table scan, but doesn't require an N^2 computation of all possible similarities.

SELECT    OrderID,    SUM(v1.Quantity * v2.Quantity) /    (SUM(v1.Quantity * v1.Quantity) +     SUM(v2.Quantity * v2.Quantity) -     SUM(v1.Quantity * v2.Quantity) ) AS coefFROM    OrderItem v1 FULL OUTER JOIN OrderItem v2    ON v1.ItemID = v2.ItemID    AND v2.OrderID = ?GROUP BY OrderIDHAVING coef > 0.85;

Formula for the Extended Jaccard Coefficient:

Tanimoto Similarity