SQL Performance: SELECT DISTINCT versus GROUP BY

sql performance oracle group-by distinct

The performance difference is probably due to the execution of the subquery in the SELECT clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by, it would execute once after the group by.

Try replacing it with a join, instead:

select . . .,       parentcntfrom . . . left outer join      (SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt       FROM ITEM_PARENTS       ) p      on items.item_id = p.parent_item_id

sql performance oracle group-by distinct

I'm fairly sure that GROUP BY and DISTINCT have roughly the same execution plan.

The difference here since we have to guess (since we don't have the explain plans) is IMO that the inline subquery gets executed AFTER the GROUP BY but BEFORE the DISTINCT.

So if your query returns 1M rows and gets aggregated to 1k rows:

The GROUP BY query would have run the subquery 1000 times,
Whereas the DISTINCT query would have run the subquery 1000000 times.

The tkprof explain plan would help demonstrate this hypothesis.

While we're discussing this, I think it's important to note that the way the query is written is misleading both to the reader and to the optimizer: you obviously want to find all rows from item/item_transactions that have a TASK_INVENTORY_STEP.STEP_TYPE with a value of "TYPE A".

IMO your query would have a better plan and would be more easily readable if written like this:

SELECT ITEMS.ITEM_ID,       ITEMS.ITEM_CODE,       ITEMS.ITEMTYPE,       ITEM_TRANSACTIONS.STATUS,       (SELECT COUNT(PKID)           FROM ITEM_PARENTS          WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID) AS CHILD_COUNT  FROM ITEMS  JOIN ITEM_TRANSACTIONS     ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID    AND ITEM_TRANSACTIONS.FLAG = 1 WHERE EXISTS (SELECT NULL                 FROM JOB_INVENTORY                    JOIN TASK_INVENTORY_STEP                    ON JOB_INVENTORY.JOB_ITEM_ID=TASK_INVENTORY_STEP.JOB_ITEM_ID                WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'                  AND ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID)

In many cases, a DISTINCT can be a sign that the query is not written properly (because a good query shouldn't return duplicates).

Note also that 4 tables are not used in your original select.

sql performance oracle group-by distinct

The first thing that should be noted is the use of Distinct indicates a code smell, aka anti-pattern. It generally means that there is a missing join or an extra join that is generating duplicate data. Looking at your query above, I am guessing that the reason why group by is faster (without seeing the query), is that the location of the group by reduces the number of records that end up being returned. Whereas distinct is blowing out the result set and doing row by row comparisons.

Update to approach

Sorry, I should have been more clear. Records are generated when users perform certain tasks in the system, so there is no schedule. A user could generate a single record in a day or hundreds per-hour. The important things is that each time a user runs a search, up-to-date records must be returned, which makes me doubtful that a materialized view would work here, especially if the query populating it would take long to run.

I do believe this is the exact reason to use a materialized view. So the process would work this way. You take the long running query as the piece that builds out your materialized view, since we know the user only cares about "new" data after they perform some arbitrary task in the system. So what you want to do is query against this base materialized view, which can be refreshed constantly on the back-end, the persistence strategy involved should not choke out the materialized view (persisting a few hundred records at a time won't crush anything). What this will allow is Oracle to grab a read lock (note we don't care how many sources read our data, we only care about writers). In the worst case a user will have "stale" data for microseconds, so unless this is a financial trading system on Wall Street or a system for a nuclear reactor, these "blips" should go unnoticed by even the most eagle eyed users.

Code example of how to do this:

create materialized view dept_mv FOR UPDATE as select * from dept;

Now the key to this is as long as you don' t invoke refresh you won't lose any of the persisted data. It will be up to you to determine when you want to "base line" your materialized view again (midnight perhaps?)

CodeHunter

SQL Performance: SELECT DISTINCT versus GROUP BY

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last