Very slow select query, how can I speed this up? Very slow select query, how can I speed this up? sqlite sqlite

Very slow select query, how can I speed this up?


You are correlating two big tables. Some quick math: 300k x 20k = 6 billion rows. If it was just a matter of returning all these rows, then you would surely be I/O bound (but really only on the (O)utput side). However, your where clause filters out almost everything, as you only have 10k rows returned so you are for sure CPU bound here.

SQLite can't use more than one index at a time, with the exception of what is called the "OR optimizations". Furthermore, you don't get any performance gain from inner joins as they "are converted into additional terms of the WHERE clause".

Bottom line is that SQLite will not be able to execute your query as efficiently as say postgresql et al.

I played around with your scenario as I was curious to see by how much your query can be optimized. Ultimately, it seems that the best optimization is to remove all explicit indices (!). It seems that SQLite figures some on-the-fly index/indices that results in better performance than the different approaches I tried.

As a demonstration, consider this schema derived from yours:

CREATE TABLE feature ( -- 300k    feature_id INTEGER PRIMARY KEY,    mzMin DOUBLE,    mzMax DOUBLE,    rtMin DOUBLE,    rtMax DOUBLE,    lnk_feature INT);CREATE TABLE spectrum ( -- 20k    spectrum_id INTEGER PRIMARY KEY,    mz DOUBLE,    rt DOUBLE,    lnk_spectrum INT);

feature has 300k rows, and spectrum 20k (the python code that does this is somewhere below). There is no explicit index specified, only implicit ones due to the definition INTEGER PRIMARY KEY:

INTEGER PRIMARY KEY columns aside, both UNIQUE and PRIMARY KEY constraints are implemented by creating an index in the database (in the same way as a "CREATE UNIQUE INDEX" statement would). Such an index is used like any other index in the database to optimize queries. As a result, there often no advantage (but significant overhead) in creating an index on a set of columns that are already collectively subject to a UNIQUE or PRIMARY KEY constraint.

Using the schema above, SQLite mentions it would create an index during the life of the query on lnk_feature:

sqlite> EXPLAIN QUERY PLAN SELECT feature_id, spectrum_id FROM spectrum, feature   ...> WHERE lnk_feature = lnk_spectrum   ...>     AND rt >= rtMin AND rt <= rtMax   ...>     AND mz >= mzMin AND mz <= mzMax;0|0|0|SCAN TABLE spectrum (~20000 rows)0|1|1|SEARCH TABLE feature USING AUTOMATIC COVERING INDEX (lnk_feature=?) (~7 rows)

And even though I tested with an index on that column, or other columns, it seems that the fastest way to run that query is without any of those indices.

The fastest I ran the query above using python is 20 minutes. This includes the completion of .fetchall(). You mention that you will have 150 times more rows at some point. I'd start looking into postgresql if I were you ;-)... Note that you could split the work in threads and potentially divide the time to complete the query by the number of threads that will be able to run concurrently (i.e. by the number of CPUs available).

In any case, here is the code I used. Can you run it yourself and report back how fast the query ran in your environment. Note that I am using apsw, so if you can't use it you'll need to tweak to use your own sqlite3 module.

#!/usr/bin/pythonimport apsw, random as rand, timedef populate(cu):    cu.execute("""CREATE TABLE feature ( -- 300k    feature_id INTEGER PRIMARY KEY,    mzMin DOUBLE, mzMax DOUBLE,    rtMin DOUBLE, rtMax DOUBLE,    lnk_feature INT);CREATE TABLE spectrum ( -- 20k    spectrum_id INTEGER PRIMARY KEY,    mz DOUBLE, rt DOUBLE,    lnk_spectrum INT);""")    cu.execute("BEGIN")    for i in range(300000):        ((mzMin, mzMax), (rtMin, rtMax)) = (get_min_max(), get_min_max())        cu.execute("INSERT INTO feature VALUES (NULL,%s,%s,%s,%s,%s)"                     % (mzMin, mzMax, rtMin, rtMax, get_lnk()))    for i in range(20000):        cu.execute("INSERT INTO spectrum VALUES (NULL,%s,%s,%s)"                    % (get_in_between(), get_in_between(), get_lnk()))    cu.execute("COMMIT")    cu.execute("ANALYZE")def get_lnk():    return rand.randint(1, 2)def get_min_max():    return sorted((rand.normalvariate(0.5, 0.004),                    rand.normalvariate(0.5, 0.004)))def get_in_between():    return rand.normalvariate(0.5, 0.49)def select(cu):    sql = """    SELECT feature_id, spectrum_id FROM spectrum, feature    WHERE lnk_feature = lnk_spectrum        AND rt >= rtMin AND rt <= rtMax        AND mz >= mzMin AND mz <= mzMax"""    start = time.time()    cu.execute(sql)    print ("%s rows; %.2f seconds" % (len(cu.fetchall()), time.time() - start))cu = apsw.Connection('foo.db').cursor()populate(cu)select(cu)

Output I get:

54626 rows; 1210.96 seconds


Make it better on the sql part.

In one word, use INDEXES!


Use between instead of >= and <= for range comparsion.

self.cursor.execute("SELECT spectrum_id, feature_table_id "+                        "FROM `spectrum` "+                        "INNER JOIN `feature` "+                        "ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+                        "WHERE spectrum.scan_start_time between feature.rtMin " +                         "AND feature.rtMax "+                        "AND spectrum.base_peak_mz between feature.mzMin "+                        "AND feature.mzMax")   

You can create non clustered index on spectrum.scan_start_time ,feature.rtMin ,feature.rtMax ,spectrum.base_peak_mz, m feature.mzMin and feature.mzMax fields.