Identifying Differences Efficiently

sql database algorithm logic generic-programming

There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.

One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.

Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row

How to hash varies by database product, but all of the major providers have some sort of hashing available in them.

The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.

It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.

sql database algorithm logic generic-programming

The standard solution to this is hash functions. What you do is have the ability to take each row, and calculate an identifier + a hash of its contents. Now you compare hashes, and if the hashes are the same then you assume that the row is the same. This is imperfect - it is theoretically possible that different values will give the same hash value. But in practice you have more to worry about from cosmic rays causing random bit flips in your computer than you do about hash functions failing to work as promised.

Both rsync and git are examples of widely used software that use hashes in this way.

In general calculating a hash before you put it in the database is faster than performing a series of comparisons inside of the database. Furthermore it allows processing to be spread out across multiple machines, rather than bottlenecked in the database. And comparing hashes is less work than comparing many fields, whether you do it in the database or out.

There are many hash functions that you can use. Depending on your application, you might want to use a cryptographic hash though you probably don't have to. More bits is better than fewer, but a 64 bit hash should be fine for the application that you describe. After processing a trillion deltas you would still have less than 1 chance in 10 million of having made an accidental mistake.

CodeHunter

Identifying Differences Efficiently

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last