CHECKSUM() collisions in SQL Server 2005 CHECKSUM() collisions in SQL Server 2005 sql sql

CHECKSUM() collisions in SQL Server 2005

I don't see where adding a checksum will get you anything with that level of collisons. Even 1 collision is too many as it would cause you to join to the wrong data. If you can't guarantee to be joining to the correct record, it is pointless if it improves performance but messes with data integrity. This appears to be financial data, so you had better be really sure that your queries won't return bad results. You could actually end up debiting or crediting the wrong accounts if there are any collisions.

If you do go this route, Marc is right that you should if at all possible pre-compute (Adding a computation that has to happen to every record in multimillion record tables is not likely to improve performance in my experience). Possibly if you can do the precomputed column (and you'll need triggers to keep it up-date) then you may not need to join to all six of the other columns to ensure no collisions. Then possibly you might have imporved performance. All you can do is test your theory. But be very sure you don't have any collisions.

Have you considered using a surrogate key and then a unique index on the six natural key fields instead? Then you could join on the surrogate key and likely that would improve performance a good bit. It can't be efficient to join on six columns (one a varchar) instead of one surrogate key. I realize from the size of the data, this might be harder to refactor than in a non-production system, but really it might be worth the down time to permananently fix persistent performance problems. Only you can say how complex a change this would be and how hard it would be to change all the sps or queries to a better join. However, it might be feasible to try.

What I've seen a lot of folks glossing over thus far is that CHECKSUM has a ton of collisions, by Microsoft's own admission. It's even worse than MD5, which has its fair share of meaningful collisions.

If you're looking to get a hash column, consider using HASHBYTES with SHA1 specified. SHA1 has a lot less meaningful collisions than MD5 or CHECKSUM. Therefore, CHECKSUM should never be used to determine if a row is unique, but rather, it's a quick check on the fidelity of two values. Therefore, your collision rate should be 0% with HASHBYTES, unless you have duplicate rows (which, being a PK, should never happen).

Keep in mind that HASHBYTES will truncate anything larger than 8000 bytes, but your PK is a lot less than that (all concatenated), so you shouldn't have any trouble.

If your checksum gets it down to 0.33% of the data, then I'd argue that it is working fine... especially if you use this column in combination with other (indexed) columns.

Of course, to be effective as an index you probably want to compute and store this value when inserting/updating data, with a non-clustered index.

Of course, a regular spanning index over the columns in question may do just as well or better...