Monday, March 12, 2007

Surrogate Keys

I may be putting my foot in it with this post. I believe wars have been fought over whether or not to use surrogate keys. There are those who will tell you that the use of surrogate keys is proof that the designer of the database did not know anything about database design. There are others who always use surrogate keys and claim anyone who designs a database with natural keys is asking for trouble. My position is, as it is so often, that both are right.

Before getting to the debate I will need to define a few terms. You probably already know most of them, but I need be sure what I say is understood, so I need to make it clear what I mean by the terms I am using.

Candidate Key
A set of one or more columns that can be used to uniquely identify a record.
Primary Key
The set of one or more columns is used internally to uniquely identify a record.
Natural Key
A primary key composed of one or more columns of data which are used by the users in the natural course of their work.
Surrogate Key
A single column, created by the data modeller solely for the purpose of uniquely identifying a record. A Surrogate Key is not derived from application data. The value of a surrogate key contains no semantic meaning.
Foreign Key
A copy of the primary key of one table, put in another table to reference that table.
Logical Data Model
the data model as seen by the user.
Physical Data Model
the data model as implemented in the DBMS.

The first step is to identify candidate keys. Candidate keys must be guaranteed unique, not coincidentally unique in the current data set. Selecting candidate keys is a business decision, not a technical decision. A database for a small club may use first name as a primary key. The club may have a policy that any new member with the same first name as an existing member will be assigned a nickname which will be unique. This is a perfectly valid business rule. Do not tell the client how they should be doing business, just make sure you understand how they do business, and that you make them aware of potential problems.

Every table MUST have one or more natural candidate keys. If you cannot identify a candidate key for a table, you have a problem in your datamodel. Do not generate a surrogate key because you can not find a natural candidate key. This is a common miss use of a surrogate key, and is, I suspect, the main reason that surrogate keys have gotten a bad name. The structure of the database must reflect the business rules. If there is no natural candidate key for a table that is saying that there are records in that table that the business dose not recognize as being different, in otherwords the table contains duplicate data. You do not want duplicate data in the database.

After you have identified, and documented, one or more candidate keys for each table, it is necessary to select a single primary key for each table. A natural key would be one of the candidate keys that have already been identified, but not all candidate keys make good primary keys.

So what is wrong with Natural keys? Imagine a small business that we are designing a customer database for. This business has a policy against assigning customers a customer number, they consider it to be dehumanizing. So they identify customers by name. There is a small problem, names are not unique. The client has a solution to this problem. Instead of using First, Middle and Last name alone as the key they will add date of first order, city, state, and street address. Now there can only be a problem if two people with the same name, living at the same address, place their first order on the same day. An unlikely event which would probably cause lots of confusion no matter how the database is designed. So we run with this as a candidate key.

Would (First_Name, Middle_Name, Last_Name, Date_of_first_order, City, State, Street_address) be a good primary key? The primary key will be used as a foreign key. The customer table will probably have lots of other tables referencing it via foreign keys. Every table that references this table will need to add seven columns to hold this foreign key. Many of these tables will probably have fewer than seven columns of data so more space will be taken up by foreign keys than by data. With disk prices dropping this may not be a big problem. but it is something you might want to think about.

There is a bigger problem. What will happen if a customer gets married and changes her name? All the foreign keys into the customer table must be changed. According to Codd, this is not a problem. Codd said that any relational database management system must provide a command to change a primary key and all associated foreign keys. Unfortunately, there is no system available that fits Codd's definition of a relational database management system, and we will need to use an existing system to implement our database. So this will be a problem. We have the same problem if someone moves.

This case is screaming for a surrogate key. In fact it is contrived, most businesses would assign a customer number. It could be argued that customernumber is a surrogate key, except that the customer number has meaning at the business level, a surrogate key does not. The important things to realize are that; A candidate key has been identified, but for technical reasons it is not well suited to be used as a primary key. Some database designers claim that no natural key is ever well suited to be a primary keybecause they are entered by users and therefor subject to data entry errors and therefor may need to be changed to correct these errors. They have a very good point.

What is wrong with surrogate keys? The biggest problem is when they are used to cover up problems in the data model by creating a key for tables that otherwise have no candidate keys. This is a big no-no. All tables must have well documented, natural, candidate keys. But there is another problem, admittedly a small one, but a problem none the less, and that is that natural keys are more natural. If the client wants to query the credit history of acustomer the natural key is the natural data to query on. If the database uses a surrogate key then it will be necessary to either do a join or a sub query to the customer data table to get the surrogate key. This can make queries significantly more complex if all tables use surrogate keys.

There are two definitions I gave at the top of this post that I have not used yet. logical data model and physical data model. IMHO these two terms are the solution to the problem of surrogate keys. The logical data model is the model you design to. The physical data model is an implementation detail that should be encapsulated at the lowest possible level of abstraction, and definitely should never show up in the business logic. You should consider implementing the logical model as views and only access the data through these views, this way you can encapsulate the physical data model inside the RDBMS.

Like I said, both sides of the surrogate key debate are right. There is no place for surrogate keys in the logical data model, and there should be no user generated primary keys in the physical data model. This give you the best of both worlds.

Labels: ,

Sunday, March 11, 2007

Data Normalization

A data element must depend on the key, the whole key and nothing but the key, so help me Codd. - Ray Kirk

I studied database normalization in the early 1990's when I was working in data modelling with Clive Finkelstein, the father of Information Engineering, and his disciples at Information Engineering Systems, Corp. I am surprised by the aura of mysticism that seems to surrounds data normalization. Most people seem to view data normalization as something that is :

  1. highly complicated that requires an advanced degree in mathematics to understand.
  2. Of no practical value, and not something you'd want to do to a real database.
Both are wrong.

Data normalization is nothing more than a restatement of the 'DRY' principal; every piece of data must have a single, unambiguous, authoritative representation within the database.

Data normalization captures the business rules and encodes them in the data model. Many people believe that normalizing data causes performances problems. It is debatable if this ever was the case, the reported performance improvements were usually gained by ignoring the business rules. But even if it was in the past, it is not now with modern databases.

So how do you capture the business rule in the data model?

Each discrete piece of data should be in its own field. Do not encode many pieces of data in a single field. Each field can be independently queried. If many pieces of data are encoded in a single field they can not be independently queried. Just because you do not need to query them independently now, does not mean you will not need to at some time in the future, so do not store compound data in a single field.

Each discrete piece of data should be recorded in exactly one place in the database (except for foreign keys). If you store the same data in multiple places you will need to put in constraints and triggers to guarantee that the data always remains constant for all copies, if you do not do this they may become inconsistent and since you do not know which copy is authoritative, you will not know what the value really is. The studies that showed data normalization hurt performance ignored this.

Eliminate repeating data. If your customer data table has columns for home phone number, work phone number, cell phone number, voice mail phone number, pager number, etc. This may represent a problem, many of your customers may not have all the phone numbers in your table, and some may have additional phone numbers that are not in your table. You should consider creating a new table to hold the phone numbers. The primary key should be a foreign key to the customer table and a description column which will contain "home", "work", "cell", etc. You can allow the description column to be free form, or you can create a type table which will have a single column containing all the legal values for the description field and making the description in the phone number table a foreign key to this type table.

Eliminate inapplicable data. Codd made a distinction between null values that are unknown and null values that are inapplicable. An example of an inapplicable null value would be if my customer data table has a column for SEX, to record whether the customer is male or female. What value should this field have in the case of a corporate customer? The correct answer is to create a secondary table, with the same primary key as the customer table, which only exists for individual customers, which will contain the sex column, and no corporate customers will have any data in this secondary table.

This is not all there is to be said about data normalization, many entire books have been written on the subject. But this should be enough to solve 80% of the data normalization problems you are likely to face. To learn more read Codd, Date, and Finkelstein.

Next time: surrogate keys.

Labels: ,