SQL Performance: SELECT DISTINCT versus GROUP BY SQL Performance: SELECT DISTINCT versus GROUP BY oracle oracle

SQL Performance: SELECT DISTINCT versus GROUP BY


The performance difference is probably due to the execution of the subquery in the SELECT clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by, it would execute once after the group by.

Try replacing it with a join, instead:

select . . .,       parentcntfrom . . . left outer join      (SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt       FROM ITEM_PARENTS       ) p      on items.item_id = p.parent_item_id


I'm fairly sure that GROUP BY and DISTINCT have roughly the same execution plan.

The difference here since we have to guess (since we don't have the explain plans) is IMO that the inline subquery gets executed AFTER the GROUP BY but BEFORE the DISTINCT.

So if your query returns 1M rows and gets aggregated to 1k rows:

  • The GROUP BY query would have run the subquery 1000 times,
  • Whereas the DISTINCT query would have run the subquery 1000000 times.

The tkprof explain plan would help demonstrate this hypothesis.


While we're discussing this, I think it's important to note that the way the query is written is misleading both to the reader and to the optimizer: you obviously want to find all rows from item/item_transactions that have a TASK_INVENTORY_STEP.STEP_TYPE with a value of "TYPE A".

IMO your query would have a better plan and would be more easily readable if written like this:

SELECT ITEMS.ITEM_ID,       ITEMS.ITEM_CODE,       ITEMS.ITEMTYPE,       ITEM_TRANSACTIONS.STATUS,       (SELECT COUNT(PKID)           FROM ITEM_PARENTS          WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID) AS CHILD_COUNT  FROM ITEMS  JOIN ITEM_TRANSACTIONS     ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID    AND ITEM_TRANSACTIONS.FLAG = 1 WHERE EXISTS (SELECT NULL                 FROM JOB_INVENTORY                    JOIN TASK_INVENTORY_STEP                    ON JOB_INVENTORY.JOB_ITEM_ID=TASK_INVENTORY_STEP.JOB_ITEM_ID                WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'                  AND ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID)

In many cases, a DISTINCT can be a sign that the query is not written properly (because a good query shouldn't return duplicates).

Note also that 4 tables are not used in your original select.


The first thing that should be noted is the use of Distinct indicates a code smell, aka anti-pattern. It generally means that there is a missing join or an extra join that is generating duplicate data. Looking at your query above, I am guessing that the reason why group by is faster (without seeing the query), is that the location of the group by reduces the number of records that end up being returned. Whereas distinct is blowing out the result set and doing row by row comparisons.

Update to approach

Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.

I do believe this is the exact reason to use a materialized view. So the process would work this way. You take the long running query as the piece that builds out your materialized view, since we know the user only cares about "new" data after they perform some arbitrary task in the system. So what you want to do is query against this base materialized view, which can be refreshed constantly on the back-end, the persistence strategy involved should not choke out the materialized view (persisting a few hundred records at a time won't crush anything). What this will allow is Oracle to grab a read lock (note we don't care how many sources read our data, we only care about writers). In the worst case a user will have "stale" data for microseconds, so unless this is a financial trading system on Wall Street or a system for a nuclear reactor, these "blips" should go unnoticed by even the most eagle eyed users.

Code example of how to do this:

create materialized view dept_mv FOR UPDATE as select * from dept; 

Now the key to this is as long as you don' t invoke refresh you won't lose any of the persisted data. It will be up to you to determine when you want to "base line" your materialized view again (midnight perhaps?)