Monday, March 12, 2007

Surrogate Keys

I may be putting my foot in it with this post. I believe wars have been fought over whether or not to use surrogate keys. There are those who will tell you that the use of surrogate keys is proof that the designer of the database did not know anything about database design. There are others who always use surrogate keys and claim anyone who designs a database with natural keys is asking for trouble. My position is, as it is so often, that both are right.

Before getting to the debate I will need to define a few terms. You probably already know most of them, but I need be sure what I say is understood, so I need to make it clear what I mean by the terms I am using.

Candidate Key
A set of one or more columns that can be used to uniquely identify a record.
Primary Key
The set of one or more columns is used internally to uniquely identify a record.
Natural Key
A primary key composed of one or more columns of data which are used by the users in the natural course of their work.
Surrogate Key
A single column, created by the data modeller solely for the purpose of uniquely identifying a record. A Surrogate Key is not derived from application data. The value of a surrogate key contains no semantic meaning.
Foreign Key
A copy of the primary key of one table, put in another table to reference that table.
Logical Data Model
the data model as seen by the user.
Physical Data Model
the data model as implemented in the DBMS.

The first step is to identify candidate keys. Candidate keys must be guaranteed unique, not coincidentally unique in the current data set. Selecting candidate keys is a business decision, not a technical decision. A database for a small club may use first name as a primary key. The club may have a policy that any new member with the same first name as an existing member will be assigned a nickname which will be unique. This is a perfectly valid business rule. Do not tell the client how they should be doing business, just make sure you understand how they do business, and that you make them aware of potential problems.

Every table MUST have one or more natural candidate keys. If you cannot identify a candidate key for a table, you have a problem in your datamodel. Do not generate a surrogate key because you can not find a natural candidate key. This is a common miss use of a surrogate key, and is, I suspect, the main reason that surrogate keys have gotten a bad name. The structure of the database must reflect the business rules. If there is no natural candidate key for a table that is saying that there are records in that table that the business dose not recognize as being different, in otherwords the table contains duplicate data. You do not want duplicate data in the database.

After you have identified, and documented, one or more candidate keys for each table, it is necessary to select a single primary key for each table. A natural key would be one of the candidate keys that have already been identified, but not all candidate keys make good primary keys.

So what is wrong with Natural keys? Imagine a small business that we are designing a customer database for. This business has a policy against assigning customers a customer number, they consider it to be dehumanizing. So they identify customers by name. There is a small problem, names are not unique. The client has a solution to this problem. Instead of using First, Middle and Last name alone as the key they will add date of first order, city, state, and street address. Now there can only be a problem if two people with the same name, living at the same address, place their first order on the same day. An unlikely event which would probably cause lots of confusion no matter how the database is designed. So we run with this as a candidate key.

Would (First_Name, Middle_Name, Last_Name, Date_of_first_order, City, State, Street_address) be a good primary key? The primary key will be used as a foreign key. The customer table will probably have lots of other tables referencing it via foreign keys. Every table that references this table will need to add seven columns to hold this foreign key. Many of these tables will probably have fewer than seven columns of data so more space will be taken up by foreign keys than by data. With disk prices dropping this may not be a big problem. but it is something you might want to think about.

There is a bigger problem. What will happen if a customer gets married and changes her name? All the foreign keys into the customer table must be changed. According to Codd, this is not a problem. Codd said that any relational database management system must provide a command to change a primary key and all associated foreign keys. Unfortunately, there is no system available that fits Codd's definition of a relational database management system, and we will need to use an existing system to implement our database. So this will be a problem. We have the same problem if someone moves.

This case is screaming for a surrogate key. In fact it is contrived, most businesses would assign a customer number. It could be argued that customernumber is a surrogate key, except that the customer number has meaning at the business level, a surrogate key does not. The important things to realize are that; A candidate key has been identified, but for technical reasons it is not well suited to be used as a primary key. Some database designers claim that no natural key is ever well suited to be a primary keybecause they are entered by users and therefor subject to data entry errors and therefor may need to be changed to correct these errors. They have a very good point.

What is wrong with surrogate keys? The biggest problem is when they are used to cover up problems in the data model by creating a key for tables that otherwise have no candidate keys. This is a big no-no. All tables must have well documented, natural, candidate keys. But there is another problem, admittedly a small one, but a problem none the less, and that is that natural keys are more natural. If the client wants to query the credit history of acustomer the natural key is the natural data to query on. If the database uses a surrogate key then it will be necessary to either do a join or a sub query to the customer data table to get the surrogate key. This can make queries significantly more complex if all tables use surrogate keys.

There are two definitions I gave at the top of this post that I have not used yet. logical data model and physical data model. IMHO these two terms are the solution to the problem of surrogate keys. The logical data model is the model you design to. The physical data model is an implementation detail that should be encapsulated at the lowest possible level of abstraction, and definitely should never show up in the business logic. You should consider implementing the logical model as views and only access the data through these views, this way you can encapsulate the physical data model inside the RDBMS.

Like I said, both sides of the surrogate key debate are right. There is no place for surrogate keys in the logical data model, and there should be no user generated primary keys in the physical data model. This give you the best of both worlds.

Labels: ,

2 Comments:

Blogger Syarzhuk said...

I'm a big proponent of surrogate keys. Not only natural keys can change their value (person changes name), but what's a natural key in one database can not be in another. Say, it might make sense for Social Security administration to use SSN as the natural key in their database, and for Department of Motor Vehicles to use the driver's license number. Now you start aggregating the data from their databases (let's say you're wasting taxpayers money pretending that siffling through databases will help us fight terrorism). Now some people don't have SSN but do have a driver's license, and some do have an SSN but don't drive. More, some don't have either. You'll have no choice but use a surrogate key.

Saturday, July 21, 2007 at 3:27:00 PM UTC  
Blogger Ed said...

But you still need a natural key in the new database, because the people using the database need some way to know if they are dealing with the same person. You're right, it can not be driver's license, and it can not be SSN. You need to talk to the users to figure out what it is.

Sunday, July 22, 2007 at 12:04:00 AM UTC  

Post a Comment

<< Home