Efficient way to ensure unique rows in SQLite3 Efficient way to ensure unique rows in SQLite3 sqlite sqlite

Efficient way to ensure unique rows in SQLite3


I have used sqlite to insert millions of rows at runtime and this is what I have used to increase performance:

  • Use as few transactions as possible.
  • Use parametrized commands forinserting the data (prepare thecommand once and just changeparamater values in the loop)
  • SetPRAGMA synchronous OFF (not surehow it works with WAL)
  • Increase page size of the database.
  • Increase cache size. This is an important setting as it will cause sqlite to actually write the data to the disk fewer times and will run more operations in memory making the whole process faster.
  • If you need an index add it after inserting the rows by running the necessary sqlite command. In this case you will need to ensure uniqueness yourself as you are currently doing it now.

If you try these please post your test results. I believe it will be interesting for everyone.


The ON CONFLICT REPLACE clause will make SQLite delete existing rows, then insert new rows. That means that SQLite is probably going to spend some of its time

  • deleting existing rows
  • updating the indexes
  • inserting new rows
  • updating the indexes

That's my take on it, based on SQLite documentation and reading about other database management systems. I didn't look at the source code.

SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index, though.

Now the really important stuff . . .

It's great that you did tests. Most developers don't do that. But I think your test results are badly misleading.

In your case, it doesn't matter how fast you can insert rows into a table that doesn't have a primary key. A table that doesn't have a primary key doesn't satisfy your basic requirements for data integrity. That means you can't rely on your database to give you the right answers.

If it doesn't have to give the right answers, I can make it really, really fast.

To get a meaningful timing for inserting into a table that has no key, you need to either

  • run code before inserting new datato make sure you don't violate theundeclared primary key constraint,and to make sure you update existingrows with the right values (insteadof inserting), or
  • run code after inserting into thattable to clean up duplicates on(Fld0, Fld2, Fld3), and to reconcileconflicts

And, of course, the time those processes take has to be taken into account, too.

FWIW, I did a test by running 100K SQL insert statements into your schema in transactions of 1000 statements, and it only took 30 seconds. A single transaction of 1000 insert statements, which seems to be what you expect in production, took 149 msec.

Maybe you can speed things up by inserting into an unkeyed temporary table, then updating the keyed table from that.


(I don't normally answer my own questions, but I would like to document a few ideas/partial solutions for this.)

The main problem with a composite primary key is the way the indexes are handled. Composite keys imply an index on the composite value, which in my case means indexing strings. While comparing string values isn't that slow, indexing a value with a length of, say, 500 bytes means that the B-tree nodes in the index can fit far fewer row/node pointers than a B-tree that indexes a 64-bit integer value. This means loading far more DB pages for each index search, as the height of the B-tree increases.

In order to deal with this issue I modified my code so that:

  • It uses WAL mode. The performance increase was certainly worth such a small change, since I do not have any issues with the DB file not being self-contained.

  • I used the MurmurHash3 hash function - after re-writing it in C and adapting it - to produce a single 32-bit hash value from the values of the fields that would form the key. I stored this hash in a new indexed column. Since this is an integer value, the index is quite fast. This is the only index for this table. Since there are going to be at most 10,000,000 rows in the table, hash collisions will not be a performance concern - although I cannot really consider the hash value to be UNIQUE, the index will only return a single row in the general case.

At this point there are two alternatives that I have coded and are currently undergoing testing:

  • DELETE FROM Event WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT.

  • UPDATE Event SET Fld1=?,... WHERE Hash=? AND Fld0=? AND Fld2=? AND Fld3=?, followed by an INSERT if no rows where updated.

I expect the second alternative to be faster, but I will have to complete the testing first. In any case, it seems that with these changes the performance drop (as compared to the original index-less table) has been lessened to a factor of 5 or so, which is far more manageable.

EDIT:

At this point I have settled with using the second variation, which is indeed slightly faster. It seems, however, that any kind of index slows down SQLite3 dramatically as the indexed table gets larger. Increasing the DB page size to 8192 bytes seems to help somewhat, but not nearly as drastically as I would like.