How to optimize count and order by query in millions of row
In query #1, but not the other two, the optimizer can use
UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)
to cut the query short for
GROUP BY c.shopify_customer_idLIMIT 20
This is because it can stop after 20 items of the index. The query is not ultra-fast because of the derived table (subquery t
) that hits about 51K rows.
Query #2 may be slow simply because the Optimizer failed to notice and remove the redundant DISTINCT
. Instead, it may be thinking it can't stop after 20.
Query #3 must go entirely through table c
to get every shopify_customer_id
group. This is because the ORDER BY
prevents a short curcuit to get to the LIMIT 20
.
The columns in a GROUP BY
must include all the non-aggregate columns in the SELECT
except any that are uniquely defined by the group by columns. Since you have said that there can be multiple addresses for a single shopify_customer_id
, then fetching ca.address1
is not proper in connection with GROUP BY shopify_customer_id
. Similarly, the subquery seems to be improper with respect to last_seen_date, last_contact_date
.
In aio_customer_tracking
, this change (to a "covering" index) may help a little:
INDEX (`shopify_customer_id`)
to
INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)
Dissecting the goal
Now, simply I want to ... count the customers
To count the customers, do this, but don't try to combine it with "fetching":
SELECT COUNT(*) FROM tbl_customers;
Now, simply I want to fetch ... the customers...
tbl_customers - #Rows: 20 million.
Surely you don't want to fetch 20 million rows! I don't want to think about how to try to do that. Please clarify. And I won't accept paginating through that many rows. Perhaps there is a WHERE
clause?? The WHERE
clause is (usually) the most important part of Optimization!
Now, simply I want to fetch ... the customers, with their one of the address, and visiting info.
Assuming that the WHERE
filters down to a "few" customers, then JOINing
to another table to get "any" address and "any" visiting info, may be problematical and/or inefficient. To require the "first" or "last" instead of "any" won't be any easier, but might be more meaningful.
May I suggest that your UI first find a few customers, then if the user wants, go to another page with all the addresses and all the visits. Or can the visits be in the hundreds or more?
Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order).
Let's focus on optimizing the WHERE
, then tack last_seen_date
on the end of any index.
shopify_customer_id
is unique in tbl_customers
table, then in 2nd query why you use distinct and group by in shopify_customer_id
column?
Please get rid of that.
Query 2 contains a logical mistake as pointed out by others: the count(distinct(c.shopify_customer_id))
will return a single value, therefore your group by is only complicating the query (this might indeed make MySQL grouping by shopify_customer_id first and then executing the count(distinct(shopify_customer_id ))
which could be the reason for the somehow long execution time
The order by of Query 3 can not be optimized as you are joining on a subselect which cannot be indexed. The time it takes is simply the time the system needs to order the result set.
The solution to your problem would be to:
change the index
shopify_customer_id
(shopify_customer_id
) of table tbl_customers_address toshopify_customer_id
(shopify_customer_id
,default
) to optimize the following querycreate a table with the result from Query 1 (result) but without
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
.alter the result table and add a column for last_seen_date and indexes for last_seen_date and shopify_customer_id
create a table for the result of this query (last_Date) :
SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id
- Update the result table with the values from table last_Date
Now you can run a query against the result table ordered by last_Date using the index you created.
The whole process should take way less time than executing query 2 or query 3