Sqlite subselect much faster than distinct + order by
Your first query orders the records first by inserting all of them into a sorted temporary table, and then implements the DISTINCT
by going through them and returning only those that are not identical to the previous one.(This can be seen in the EXPLAIN
output shown below; the DISTINCT
actually got converted to a GROUP BY
, which behaves the same.)
Your second query is, in theory, identical to the first, but SQLite's query optimizer is rather simple and cannot prove that this conversion would be safe (as explained in the subquery flattening documentation).Therefore, it is implemented by doing the DISTINCT
first, by inserting only any non-duplicates into a temporary table, and then doing the ORDER BY
with a second temporary table.This second step is completely superfluous because the first temp table was already sorted, but this happens to be faster for your data anyway because you have so many duplicates that are never stored in either temp table.
In theory, your first query could be faster, because SQLite has already recognized that the DISTINCT
and ORDER BY
clauses can be implemented with the same sorted temporary table.In practice, however, SQLite it is not smart enough to remember that the DISTINCT
implies that duplicates do not need to be stored in the temp table.(This particular optimization might be added to SQLite if you ask nicely on the mailing list.)
$ sqlite3 mydb sqlite> .explainsqlite> explain SELECT DISTINCT acolumn FROM atable ORDER BY acolumn;addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- -------------0 Trace 0 0 0 00 1 SorterOpen 1 2 0 keyinfo(1,BINARY) 00 2 Integer 0 3 0 00 clear abort flag3 Integer 0 2 0 00 indicate accumulator empty4 Null 0 6 6 00 5 Gosub 5 37 0 00 6 Goto 0 40 0 00 7 OpenRead 0 2 0 1 00 atable 8 Rewind 0 14 0 00 9 Column 0 0 8 00 atable.acolumn10 Sequence 1 9 0 00 11 MakeRecord 8 2 10 00 12 SorterInsert 1 10 0 00 13 Next 0 9 0 01 14 Close 0 0 0 00 15 OpenPseudo 2 10 2 00 16 SorterSort 1 39 0 00 GROUP BY sort17 SorterData 1 10 0 00 18 Column 2 0 7 20 19 Compare 6 7 1 keyinfo(1,BINARY) 00 20 Jump 21 25 21 00 21 Move 7 6 0 00 22 Gosub 4 32 0 00 output one row23 IfPos 3 39 0 00 check abort flag24 Gosub 5 37 0 00 reset accumulator25 Column 2 0 1 00 26 Integer 1 2 0 00 indicate data in accumulator27 SorterNext 1 17 0 00 28 Gosub 4 32 0 00 output final row29 Goto 0 39 0 00 30 Integer 1 3 0 00 set abort flag31 Return 4 0 0 00 32 IfPos 2 34 0 00 Groupby result generator entry point33 Return 4 0 0 00 34 Copy 1 11 0 00 35 ResultRow 11 1 0 00 36 Return 4 0 0 00 end groupby result generator37 Null 0 1 0 00 38 Return 5 0 0 00 39 Halt 0 0 0 00 40 Transaction 0 0 0 00 41 VerifyCookie 0 2 0 00 42 TableLock 0 2 0 atable 00 43 Goto 0 7 0 00
sqlite> explain SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable) ORDER BY acolumn;addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- -------------0 Trace 0 0 0 00 1 Goto 0 39 0 00 2 Goto 0 17 0 00 3 OpenPseudo 0 3 1 01 coroutine for sqlite_subquery_DA7480_4 Integer 0 2 0 01 5 OpenEphemeral 2 0 0 keyinfo(1,BINARY) 08 6 OpenRead 1 2 0 1 00 atable 7 Rewind 1 14 0 00 8 Column 1 0 3 00 atable.acolumn9 Found 2 13 3 1 00 10 MakeRecord 3 1 4 00 11 IdxInsert 2 4 0 00 12 Yield 1 0 0 00 13 Next 1 8 0 01 14 Close 1 0 0 00 15 Integer 1 2 0 00 16 Yield 1 0 0 00 end sqlite_subquery_DA7480_17 SorterOpen 3 3 0 keyinfo(1,BINARY) 00 18 Integer 2 1 0 00 19 Yield 1 0 0 00 next row of co-routine sqlite_subquery_DA7480_20 If 2 29 0 00 21 Column 0 0 5 00 sqlite_subquery_DA7480_.acolumn22 MakeRecord 5 1 6 00 23 Column 0 0 7 00 sqlite_subquery_DA7480_.acolumn24 Sequence 3 8 0 00 25 Move 6 9 0 00 26 MakeRecord 7 3 10 00 27 SorterInsert 3 10 0 00 28 Goto 0 19 0 00 29 OpenPseudo 4 6 1 00 30 OpenPseudo 5 11 3 00 31 SorterSort 3 37 0 00 32 SorterData 3 11 0 00 33 Column 5 2 6 20 34 Column 4 0 5 20 35 ResultRow 5 1 0 00 36 SorterNext 3 32 0 00 37 Close 4 0 0 00 38 Halt 0 0 0 00 39 Transaction 0 0 0 00 40 VerifyCookie 0 2 0 00 41 TableLock 0 2 0 atable 00 42 Goto 0 2 0 00
Inside most DBMS, SQL statements are translated into relational algebra and then structured in an expression tree.The dbms then uses heuristics to optimise queries. One of the main heuristics is "Perform selection early" (p.46). I suppose the sqlite query planner does this as well, hence the differences in execution time.
Since the result of the subquery is much smaller (~50 rows opposed to 4.5 million), sorting, at the end of the expression tree, happens much faster. (Plain) Selecting isn't a very expensive process, running operations on a multitude of results is indeed.
I believe this must be because the order operation and distinct operations must be implemented more efficiently when separated by the subselect - which is effectively a simpler way to say way alexdeloy is saying.
This experiment is not complete. Please also run the following:
% echo "SELECT acolumn FROM (SELECT DISTINCT acolumn FROM atable ORDER BY acolumn) ;" | time sqlite3 mydb
Tell me if this takes longer than the other two on average and thanks.