How to optimize count and order by query in millions of row How to optimize count and order by query in millions of row database database

How to optimize count and order by query in millions of row


In query #1, but not the other two, the optimizer can use

UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)

to cut the query short for

GROUP BY c.shopify_customer_idLIMIT 20

This is because it can stop after 20 items of the index. The query is not ultra-fast because of the derived table (subquery t) that hits about 51K rows.

Query #2 may be slow simply because the Optimizer failed to notice and remove the redundant DISTINCT. Instead, it may be thinking it can't stop after 20.

Query #3 must go entirely through table c to get every shopify_customer_id group. This is because the ORDER BY prevents a short curcuit to get to the LIMIT 20.

The columns in a GROUP BY must include all the non-aggregate columns in the SELECT except any that are uniquely defined by the group by columns. Since you have said that there can be multiple addresses for a single shopify_customer_id, then fetching ca.address1 is not proper in connection with GROUP BY shopify_customer_id. Similarly, the subquery seems to be improper with respect to last_seen_date, last_contact_date.

In aio_customer_tracking, this change (to a "covering" index) may help a little:

INDEX (`shopify_customer_id`)

to

INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)

Dissecting the goal

Now, simply I want to ... count the customers

To count the customers, do this, but don't try to combine it with "fetching":

SELECT COUNT(*) FROM tbl_customers;

Now, simply I want to fetch ... the customers...

tbl_customers - #Rows: 20 million.

Surely you don't want to fetch 20 million rows! I don't want to think about how to try to do that. Please clarify. And I won't accept paginating through that many rows. Perhaps there is a WHERE clause?? The WHERE clause is (usually) the most important part of Optimization!

Now, simply I want to fetch ... the customers, with their one of the address, and visiting info.

Assuming that the WHERE filters down to a "few" customers, then JOINing to another table to get "any" address and "any" visiting info, may be problematical and/or inefficient. To require the "first" or "last" instead of "any" won't be any easier, but might be more meaningful.

May I suggest that your UI first find a few customers, then if the user wants, go to another page with all the addresses and all the visits. Or can the visits be in the hundreds or more?

Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order).

Let's focus on optimizing the WHERE, then tack last_seen_date on the end of any index.


shopify_customer_id is unique in tbl_customers table, then in 2nd query why you use distinct and group by in shopify_customer_id column?

Please get rid of that.


Query 2 contains a logical mistake as pointed out by others: the count(distinct(c.shopify_customer_id)) will return a single value, therefore your group by is only complicating the query (this might indeed make MySQL grouping by shopify_customer_id first and then executing the count(distinct(shopify_customer_id )) which could be the reason for the somehow long execution time

The order by of Query 3 can not be optimized as you are joining on a subselect which cannot be indexed. The time it takes is simply the time the system needs to order the result set.

The solution to your problem would be to:

  1. change the index shopify_customer_id (shopify_customer_id) of table tbl_customers_address to shopify_customer_id (shopify_customer_id,default) to optimize the following query

  2. create a table with the result from Query 1 (result) but without

    LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id.

  3. alter the result table and add a column for last_seen_date and indexes for last_seen_date and shopify_customer_id

  4. create a table for the result of this query (last_Date) :

SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id

  1. Update the result table with the values from table last_Date

Now you can run a query against the result table ordered by last_Date using the index you created.

The whole process should take way less time than executing query 2 or query 3