Sunday, March 11, 2007

Data Normalization

A data element must depend on the key, the whole key and nothing but the key, so help me Codd. - Ray Kirk

I studied database normalization in the early 1990's when I was working in data modelling with Clive Finkelstein, the father of Information Engineering, and his disciples at Information Engineering Systems, Corp. I am surprised by the aura of mysticism that seems to surrounds data normalization. Most people seem to view data normalization as something that is :

  1. highly complicated that requires an advanced degree in mathematics to understand.
  2. Of no practical value, and not something you'd want to do to a real database.
Both are wrong.

Data normalization is nothing more than a restatement of the 'DRY' principal; every piece of data must have a single, unambiguous, authoritative representation within the database.

Data normalization captures the business rules and encodes them in the data model. Many people believe that normalizing data causes performances problems. It is debatable if this ever was the case, the reported performance improvements were usually gained by ignoring the business rules. But even if it was in the past, it is not now with modern databases.

So how do you capture the business rule in the data model?

Each discrete piece of data should be in its own field. Do not encode many pieces of data in a single field. Each field can be independently queried. If many pieces of data are encoded in a single field they can not be independently queried. Just because you do not need to query them independently now, does not mean you will not need to at some time in the future, so do not store compound data in a single field.

Each discrete piece of data should be recorded in exactly one place in the database (except for foreign keys). If you store the same data in multiple places you will need to put in constraints and triggers to guarantee that the data always remains constant for all copies, if you do not do this they may become inconsistent and since you do not know which copy is authoritative, you will not know what the value really is. The studies that showed data normalization hurt performance ignored this.

Eliminate repeating data. If your customer data table has columns for home phone number, work phone number, cell phone number, voice mail phone number, pager number, etc. This may represent a problem, many of your customers may not have all the phone numbers in your table, and some may have additional phone numbers that are not in your table. You should consider creating a new table to hold the phone numbers. The primary key should be a foreign key to the customer table and a description column which will contain "home", "work", "cell", etc. You can allow the description column to be free form, or you can create a type table which will have a single column containing all the legal values for the description field and making the description in the phone number table a foreign key to this type table.

Eliminate inapplicable data. Codd made a distinction between null values that are unknown and null values that are inapplicable. An example of an inapplicable null value would be if my customer data table has a column for SEX, to record whether the customer is male or female. What value should this field have in the case of a corporate customer? The correct answer is to create a secondary table, with the same primary key as the customer table, which only exists for individual customers, which will contain the sex column, and no corporate customers will have any data in this secondary table.

This is not all there is to be said about data normalization, many entire books have been written on the subject. But this should be enough to solve 80% of the data normalization problems you are likely to face. To learn more read Codd, Date, and Finkelstein.

Next time: surrogate keys.

Labels: ,

2 Comments:

Blogger Syarzhuk said...

Data normalization is very very very very useful for transactional systems with a lot of inserts/updates/deletes. However, normalization usually creates way more tables than people initially think, and it can be sometimes a pain to join the tables back together. Data warehousing is a field where denormalization is common. My favorite example of denormalization is a typical (street, city, state, zip) combination - since city and state could be derived from the zip code, this is not in third normal form!

Saturday, July 21, 2007 at 4:03:00 PM UTC  
Blogger Ed said...

The business rules drive the normalization. If there is a rule that requires that the zip code must define the city, street and state then the zip code should be normalized, but if there is no such rule there is no such normalization.

Sunday, July 22, 2007 at 12:20:00 AM UTC  

Post a Comment

<< Home