Procedurally transform subquery into join Procedurally transform subquery into join sql sql

Procedurally transform subquery into join


Converting a subquery into a JOIN can be pretty straightforward:

IN clause

 FROM TABLE_X xWHERE x.col IN (SELECT y.col FROM TABLE_Y y)

...can be converted to:

FROM TABLE_X xJOIN TABLE_Y y ON y.col = x.col

Your JOIN criteria is where you have direct comparison.

EXISTS clause

But there are complications when you look at the EXISTS clause. EXISTS are typically correllated, where the subquery is filtered by criteria from the table(s) outside the subquery. But the EXISTS is only for returning a boolean based on the criteria.

 FROM TABLE_X xWHERE EXISTS (SELECT NULL                FROM TABLE_Y y               WHERE y.col = x.col)

...converted:

FROM TABLE_X xJOIN TABLE_Y y ON y.col = x.col

Because of the boolean, there's a risk of more rows turning up in the resultset.

SELECTs in the SELECT clause

These should always be changed, with prejudice:

SELECT x.*,       (SELECT MAX(y.example_col)          FROM TABLE_Y y         WHERE y.col = x.col)  FROM TABLE_X x

You're probably noticing a patter now, but I made this a little different for an inline view example:

SELECT x.*,       z.mc  FROM TABLE_X x  JOIN (SELECT y.col, --inline view within the brackets               MAX(y.example_col) 'mc'          FROM TABLE_Y y      GROUP BY y.col) z ON z.col = x.col

The key is making sure the inline view resultset includes the column(s) needed to join to, along with the columns.

LEFT JOINs

You might've noticed I didn't have any LEFT JOIN examples - this would only be necessary if columns from the subquery use NULL testing (COALESCE on almost any db these days, Oracle's NVL or NVL2, MySQLs IFNULL, SQL Server's ISNULL, etc...):

SELECT x.*,       COALESCE((SELECT MAX(y.example_col)          FROM TABLE_Y y         WHERE y.col = x.col), 0)  FROM TABLE_X x

Converted:

   SELECT x.*,          COALESCE(z.mc, 0)     FROM TABLE_X xLEFT JOIN (SELECT y.col,                  MAX(y.example_col) 'mc'             FROM TABLE_Y y         GROUP BY y.col) z ON z.col = x.col

Conclusion

I'm not sure if that will satisfy your typographic needs, but hope I've demonstrated that the key is determining what the JOIN criteria is. Once you know the column(s) involved, you know the table(s) involved.


This question relies on a basic knowledge of Relational Algebra. You need to ask yourself what kind of join is being performed. For example, an LEFT ANTI SEMI JOIN is like a WHERE NOT EXISTS clause.

Some joins do not allow duplicating of data, some do not allow eliminating data. Others allow extra fields to be available. I discuss this in my blog at http://msmvps.com/blogs/robfarley/archive/2008/11/09/join-simplification-in-sql-server.aspx

Also, please don't feel you need to do everything in JOINs. The Query Optimizer should take care of all of this for you, and you can often make your queries much harder to maintain this way. You may find yourself using an extensive GROUP BY clause, and having interesting WHERE .. IS NULL filters, which will only serve to disconnect the business logic from the query design.

A subquery in the SELECT clause (essentially a lookup) only provides an extra field, not duplication or elimination. Therefore, you would need to make sure that you enforce GROUP BY or DISTINCT values in your JOIN, and use an OUTER JOIN to guarantee that behaviour is the same.

A subquery in the WHERE clause can never duplicate data, or provide extra columns to the SELECT clause, so you should use GROUP BY / DISTINCT to check this. WHERE EXISTS is similar. (This the LEFT SEMI JOIN)

WHERE NOT EXISTS (LEFT ANTI SEMI JOIN) doesn't provide data, and doesn't duplicate rows, but can eliminate... for this you need to do LEFT JOINs and look for NULLs.

But the Query Optimizer should handle all this for you. I actually like having occasional subqueries in the SELECT clause, because it makes it very clear that I am not duplicating or eliminating rows. The QO can tidy it for me, but if I'm using a view or inline table-valued function, I want to make it clear to those who come after me that the QO can simplify it down a lot. Have a look at the Execution Plans of your original query, and you'll see that the system is providing the INNER/OUTER/SEMI joins for you.

The thing you really need to be avoiding (at least in SQL Server) are functions that use BEGIN and END (such as Scalar Functions). They may feel like they simplify your code, but they will actually be executed in a separate context, as the system sees them as procedural (not simplifiable).

I did a session on this kind of thing at the recent SQLBits V conference. It was recorded, so you should be able to watch it at some point (if you can put up with my jokes!)


It often is possible, and what's good is that the query optimizer can do it automatically, so you don't have to care about it.