Identifying Differences Efficiently Identifying Differences Efficiently database database

Identifying Differences Efficiently


There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.

One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.

Primary key match + hash match = Unchanged row

Primary key match + hash mismatch = Updated row

Primary key in incoming data but missing from existing data set = New row

Primary key not in incoming data but in existing data set = Deleted row

How to hash varies by database product, but all of the major providers have some sort of hashing available in them.

The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.

It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.


The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.

Both rsync and git are examples of widely used software that use hashes in this way.

In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.

There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.