SQLite3 query optimization join vs subselect

TL;DR: The best query and index is:

create index uniqueFiles on resume_points (scan_file_id);select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

Since I typically work with SQL Server, at first I thought that surely the query optimizer would find the optimal execution plan for such a simple query regardless of which way you write these equivalent SQL statements. So I downloaded SQLite, and started playing around. Much to my surprise, there was a huge difference in performance.

Here's the setup code:

CREATE TABLE files (id INTEGER PRIMARY KEY autoincrement,dirty INTEGER NOT NULL);CREATE TABLE resume_points (id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,scan_file_id INTEGER NOT NULL );insert into files (dirty) values (0);insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

I considered two indices:

create index dirtyFiles on files (dirty, id);create index uniqueFiles on resume_points (scan_file_id);create index fileLookup on files (id);

Below are the queries I tried and the execution times on my i5 laptop. The database file size is only about 200MB since it doesn't have any other data.

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;4.3 - 4.5ms with and without indexselect distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;4.4 - 4.7ms with and without indexselect * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;2.0 - 2.5ms with uniqueFiles2.6-2.9ms without uniqueFilesselect * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;2.1 - 2.5ms with uniqueFiles2.6-3ms without uniqueFilesSELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.idWHERE f.dirty = 1 GROUP BY f.id4500 - 6190 ms with uniqueFiles8.8-9.5 ms without uniqueFiles    14000 ms with uniqueFiles and fileLookupselect * from files where exists (select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;8400 ms with uniqueFiles7400 ms without uniqueFiles

It looks like SQLite's query optimizer isn't very advanced at all. The best queries first reduce resume_points to a small number of rows (Two in the test case. The OP said it would be 1-2.), and then look up the file to see if it is dirty or not. dirtyFiles index didn't make much of a difference for any of the files. I think it may be because of the way the data is arranged in the test tables. It may make a difference in production tables. However, the difference is not too great as there will be less than a handful of lookups. uniqueFiles does make a difference since it can reduce 10000 rows of resume_points to 2 rows without scanning through most of them. fileLookup did make some queries slightly faster, but not enough to significantly change the results. Notably it made group by very slow. In conclusion, reduce the result set early to make the biggest differences.

sql database sqlite query-optimization

Since files.id is the primary key, try GROUPing BY this field rather than checking DISTINCT files.*

SELECT f.*FROM resume_points rpINNER JOIN files f on rp.scan_file_id = f.idWHERE f.dirty = 1GROUP BY f.id

Another option to consider for performance is adding an index to resume_points.scan_file_id.

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

sql database sqlite query-optimization

You could try exists, which will not produce any duplicate files:

select * from fileswhere exists (    select * from resume_points     where files.id = resume_points.scan_file_id)and dirty = 1;

Of course it might help to have the proper indexes:

files.dirtyresume_points.scan_file_id

Whether an index is helpful will depend on your data.

CodeHunter

SQLite3 query optimization join vs subselect

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last