Is it worth breaking out address information into a separate database table? Is it worth breaking out address information into a separate database table? database database

Is it worth breaking out address information into a separate database table?


Normalizing address into a hierarchy is a questionable proposition. It really depends on what you mean to do with your address data.

The idea of normalizing to avoid update anomalies is a little dubious. How often do cities, states or countries actually change names? Furthermore, if this were to happen, how likely would it be that the change would be wholesale? (i.e. every instance of old name X changes to new name Y). I can tell you what happened in practice in Canada when there was a flurry of municipal amalgamations in the 2000's was that boundaries were redrawn, and that lots of old names stuck around, just with smaller territories than before.

The fact is that things like municipality names can be loosely defined. For example, where I grew up, my address had three officially recognized municipality names according to the postal authority: WILLOWDALE, NORTH YORK, TORONTO - all of which were valid options, although one was "more official" than the others. The problem is that all of Willowdale is in North York, but North York also contains "Downsview" and others.

Other frequent arguments for normalizing addresses include: ensure proper spelling and providing a basis for territory management. Given the vagaries of address data quality, these arguments are not convincing.

The best way to ensure address data quality is to keep your addresses in a relatively flat, relatively simple structure and to employ one or more address quality tools that use postal authority data to match and standardize your addresses. Keep city, state and postal code in their own fields, by all means, but don't keep them in distinct tables. This is actually more flexible than a normalized structure while producing more reliable results overall.

Similarly, territory management is best done at a more granular level than municipality. Some municipalities are enormous and names can be ambiguous. Instead use a postal code or ZIP+4 (depending on jurisdiction). This is much more granular and unambiguous. Again, an address data quality tool will ensure that you have proper postal coding on your addresses.


From my experience, yes.

1 The city, state and country are entities in the real world so it is good to have them as entities in your database model. It keeps the names consistent as the other answerers have already mentioned

2 You may populate them and validate them from external open sources or standards bodies. Eg for countries it is international standard ISO3166

3 In your present or future versions of your app, you may even connect directly to external sources to maintain them.

4 If you ever go multi-lingual you will already have the names to translate all in one place

5 If you ever exchange or interface data with other parties or apps, you will need the common classifications


Before I get started, I want to point out that {city, state, country} isn't an address.

Should things like City, or State or Country be normalized and broken up into their own table and then this table have CityId and StateId columns. We were having a debate whether this was a good or bad decision.

Normalization is good. I'm almost always advocating for normalization.

But using ID numbers instead of text has nothing to do with normalization. Substituting "CityId" for "City" and "StateId" for "State" has no effect on the normal form of the table. If it were in 3NF before that change, it will still be in 3NF after that change.

You can increase data integrity with a foreign key reference. Data integrity is also good. But this, like many other database design decisions, doesn't have anything to do with normalization.

The simplest way to increase data integrity for cities would be to select distinct cities into a new table. (PostgreSQL syntax.)

select distinct city, state, countryinto new_tablefrom person;

You need city, state, and country to represent the "full name" of a city. You also need a key.

alter table new_tableadd primary key (city, state, country);

Now you can declare a foreign key constraint to guarantee that {city, state, country} will always reference one single row in that new table.

alter table Personadd constraint city_state_country_from_new_tableforeign key (city, state, country)references new_table (city, state, country)on update cascade;

I wouldn't worry about the performance of cascading updates for this kind of table. (Unless I were using Oracle; Oracle doesn't support cascading updates.) These kinds of names change rarely, and I know PostgreSQL can cascade updates to 3 million rows in a table of 50 million rows in less than 3 seconds on my desktop. My desktop isn't anything special, and it's running 3 database management systems and two web servers. If I had bigger tables and needed more time, I'd schedule the change during a maintenance window.

You can increase data integrity for states in the same way.

select distinct state, countryinto another_new_tablefrom new_table;etc., etc.

Having said all that, adding a surrogate key to the new_table is a defensible design decision, but only if you spend some time thinking about it. (Not thinking is never defensible.)

The most immediate effect of replacing {city, state, country} with a surrogate key is that you now need a join in every query on a table that required no joins before. You can test the effect on performance with random sample data. Until you have many millions of rows, you'll probably find that the natural key is faster than a join on the surrogate key. That's what I found when I tested.