Data normalization and writing queries Data normalization and writing queries sql sql

Data normalization and writing queries


general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Only partly true.

Normalization is not about "redundancy".

It's about "update anomalies".

1NF is the "don't use arrays" rules. Breaking 1NF means a row isn't atomic, but a collection and independent updates in the collection wouldn't work out well. There'd be locking and slowness.

2NF is the "one key" rule. Each row has exactly one key and everything in the row depends on the key. There are no dependencies on part of the key. Some folks like to talk about candidate keys and natural keys and foreign keys; they may exist or they may not. 2NF is satisfied when all attributes depend on one key. If the key is a single-column surrogate key, this normal form is trivially satisfied.

If 2NF is violated, you've got columns which depend on part of a key, but not the whole key. If you had a table with (Part Number, Revision Number) as a key, and attributes of color and weight, where weight depends on the whole key, but color only depends on the part number. You have a 2NF problem where you could update some part colors but not others, creating data anomalies.

3NF is the "only the key" rule. If you put derived data in a row, and change the derived result, it doesn't match the source columns. If you change a source column without updating the derived value, you have a problem, too. Yes, triggers are a bad hackaround to allow 3NF design violations. That's not the point. The point is merely to define 3NF and show that it prevents an update problem.

each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization?

It is.


Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Ummm, ok.

In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it.

The number of tables doesn't say anything about whether the design is good or bad. Some businesses need one or two. Others need more. I've worked on databases at Fortune 500s that had thousands of tables.

The number of columns doesn't say anything about whether the design is good or bad. And the number of columns has nothing to do with fragmentation. I will say that tables that have relatively few columns is generally a good sign. Not always a good sign, but generally a good sign.

Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

There are two different, common reasons for that.

When you normalize a table, you reduce redundancy (and increase data integrity) by identifying functional dependencies, isolating the functionally dependent columns in one or more new tables, and removing them from the original table. So normalizing a table, in the sense of moving from a lower normal form to a higher normal form

  • always increases the number oftables,
  • always reduces the number of columnsin the original table, and
  • sometimes requires a join to retrievedata for humans.

Another common practice is to replace strings with id numbers. This has nothing to do with normalization. (There's no such thing as "id number normal form".) Replacing strings with id numbers

  • always increases the number oftables,
  • doesn't change the number of columnsin the original table (unless done atthe same time as normalization),
  • always requires a join to retrieve data for humans.

There seems to be some confusion in other parts of this thread. I realize that, strictly speaking, none of the following is directly related to the OP's question.

1NF is the "one value" principle. It doesn't have anything to do with a row being "atomic". In the relational model, atomic doesn't refer to rows; it refers to values.

"One value" means that each intersection of a row and a column contains a single value. (In other words, the value is "atomic". But the word atomic has some unfortunate connotations, so most modern practitioners avoid it.) That value doesn't need to be simple; it can be arbitrarily complex. But if it has parts that themselves have meaning, the dbms either completely ignores those parts, or it provides functions to manipulate them. (You don't have to write functions to manipulate the parts.)

I think the simplest example is a date. Dates have parts, consisting of a year, month, and day. The dbms either ignores those parts (as in SELECT CURRENT_DATE) or it provides functions to manipulate them (as in SELECT EXTRACT(YEAR FROM CURRENT_DATE)).

Attempts to dodge the "one value" principle lead to a corollary: the "no repeating groups" principle.

A repeating group involves multiple values from one domain, all values having the same meaning. So a table like the following one is an example of one kind of repeating group. (There are other kinds.) Values for both "phone_1" and "phone_2" come from the same domain, and they have the same meaning--user 'n' has phone numbers (phone_1 and phone_2). (Primary key is "user_id".)

user_id    phone_1           phone_2    1          (111) 222-3333    (111) 222-33342          (111) 222-3335    (111) 222-3336

But the next table, although it's very similar, doesn't have a repeating group. The values come from the same domain, but they don't have the same meaning. (Primary key is "user_id".)

user_id    home_phone        work_phone    3          (111) 222-3333    (111) 222-33344          (111) 222-3335    (111) 222-3336

2NF is the "whole key" principle. It doesn't have anything to do with the number of keys; a table having 'n' columns could have 'n' keys. (See, for example, this other SO answer.) In the relational model (and, by extension, when you're doing normalization exercises), if you see the word key by itself, think "candidate key".

Instead, 2NF has to do with candidate keys that have multiple columns. When a candidate key has multiple columns, 2NF requires that every non-prime attribute be functionally dependent on all the columns of every candidate key, not on just some of the columns of any candidate key. (A non-prime attribute is an attribute that's not part of any candidate key.)

The following example is adapted from the Wikipedia entry on 2nf. (Primary key is {employee, skill}.)

Table: employee_skillsemployee        skill            current_work_location--Jones           Typing           114 Main StreetJones           Shorthand        114 Main StreetJones           Whittling        114 Main StreetBravo           Light Cleaning   73 Industrial WayEllis           Alchemy          73 Industrial WayEllis           Flying           73 Industrial WayHarrison        Light Cleaning   73 Industrial Way

While it's true that the non-prime column current_work_location is functionally dependent on the primary key {employee, skill}, it's also functionally dependent on just part of the primary key, "employee". That table isn't in 2NF.

You can't dodge a 2NF problem by assigning each row a surrogate key. (Primary key is es_id; there's a UNIQUE constraint on the former primary key, {employee, skill}).

Table: employee_skillses_id   employee        skill            current_work_location--1       Jones           Typing           114 Main Street2       Jones           Shorthand        114 Main Street3       Jones           Whittling        114 Main Street4       Bravo           Light Cleaning   73 Industrial Way5       Ellis           Alchemy          73 Industrial Way6       Ellis           Flying           73 Industrial Way7       Harrison        Light Cleaning   73 Industrial Way

It should be obvious that adding the id number did nothing to remove the partial dependency employee->current_work_location. Without removing the partial dependency, this table is still not in 2NF.

3NF is the "no transitive dependencies" principle. It doesn't necessarily have anything to do with derived or calculated data, as you can tell from the Wikipedia example, adapted here. (Primary key is {tournament, year}. This table is not in 3NF.)

Table: tournament_winnerstournament             year  winner            winner_date_of_birth--Indiana Invitational   1998  Al Fredrickson    21 July 1975Cleveland Open         1999  Bob Albertson     28 September 1968Des Moines Masters     1999  Al Fredrickson    21 July 1975Indiana Invitational   1999  Chip Masterson    14 March 1977

Two dependencies show that this table has a transitive dependency.

  1. The values in winner_date_of_birthappear to be functionally dependent on theprimary key. Each primary key valuedetermines one and only one valuefor winner_date_of_birth. But . . .
  2. The values in winner_date_of_birthalso appear to be functionally dependent onwinner. Each value for winnerdetermines one and only one valuefor winner_date_of_birth.

Given those two apparent functional dependencies and an understanding of what the words tournament, winner, and date of birth mean, we can say that

  • winner -> winner_date_of_birth is afunctional dependency, and
  • {tournament, year} -> winner is a functional dependency, and
  • {tournament, year} ->winner_date_of_birth is a transitivedependency.


Database views are a critically important tool in this dilemma. This excellent introduction says:

Here's the good news: you don't have to work with the normalized tables! ... It is very easy (at least for DBAs) to create an abstraction layer of joined views on top of the normalized data tables, putting the base tables completely "behind the scenes", and out of sight.