When to store precalculated values vs calculating them when retrieving them? When to store precalculated values vs calculating them when retrieving them? database database

When to store precalculated values vs calculating them when retrieving them?


About your two questions:

I can tell new developers that these structures are not normalized, but they can say its faster. How do I counter that? Do I counter that? Do others structure their databases like this?!

It may be faster, but it's not necessarily so: whenever you decide to add extra information to a table (the extra fields, in your case) you are also adding a performance penalty because the table gets bigger, which may mean more data travelling from server to clients, or to be paged in or out of memory... also if the field is there to speed up queries it will probably have one or more index on this, which again has a performance penalty during updates and inserts.The main point, though, is the one I hinted at in my comment: "cached" and "precomputed" values make the system more fragile in terms of data integrity.Are you sure that "event_creator_id" is always correctly pointing to the real creator, even in case someone has amended the original value? If yes, this has costs too, both in terms of computation (you have to update all the tables when the creator is changed) and in terms of actual development and testing effort (are you sure nobody forgot to propagate changes to the precomputed fields?).

Same goes for aggregate values like "discounted price" or running totals... and changing the original data is probably much more frequent than changing the "event creator" information. Again, is there a proper "cache invalidation" mechanism in place to ensure that the total sales are recomputed whenever someone completes a sale? What about returned item? Has anyone considered the cost of ensuring integrity?

Running totals and other derived values should be implemented by using views, instead, so that caching, if any, is performed by the actual DBMS engine, who knows how to properly take care of this.

Is there a rule of thumb, or a set of principles which I can use to say that - 'oh, it will be slower, but only by 1%, so its okay to do it this way', etc?

A DB (or arguably any kind of computing system) should be "correct first" so that you can find how to make it "fast enough, second".Trading correctness for speed is a decision that you should not take when designing the DB, unless you already know that the timeliness is considered way more important than correctness. I.e. your requirements clearly state that having a possibly wrong or outdated information is less important than response time.

In other words: designing a table with redundant cached info is another example of premature optimization and should be avoided at all costs.

See also this - especially the answers


Any db book I have read on relational design has always included a section on "planned" redundancy or "limited" de-normalization. It depends on the environment. Wells Fargo pre-calculates bank statement totals and stores the pre-calcs.

Imagine how long it would take to do those calculations if they waited to the end of each cycle when they go to print the statement.

Planned redundancy is Normal!