Database Design - are you Novice, Ninja or Nincompoop?

Posted on 23rd September 2012 by Tony Marston

Amended on 13th March 2021

Introduction
Basic Database Rules
Secondary Database Conventions
Improvements which may not be for the better
Rules which do not exist
Common mistakes
How can you have a Foreign Key without a Foreign Key Constraint?
Compound keys require too many keystrokes
Compound keys make it difficult for SQL generators
Conclusion
References
Amendment History
Comments

Introduction

In my long career in IT I have designed and built many applications using different languages, different paradigms, and different storage systems. Among these storage systems have been flat files, indexed files, hierarchical databases, network databases and relational databases. The vast majority of my work over the past 20 years has been with relational databases which are accessed using the ubiquitous SQL language. Although the technology is constantly evolving, with new features and new ways of doing things, there are some concepts which have stood the test of time and attempts by newcomers to supplant them with new and "cool" ideas often have unintended consequences further down the line. Whenever I see a database which has been designed by someone who has yet to realise that these "cool" ideas are not so cool as he would like to think I always give a deep sigh and say "Will these novices never learn?" (I tell a lie. What I actually thinks is "WTF! What idiot designed this pile of cr*p?" but I'm too polite to say so.) I am therefore writing this article in the hope that my experiences will provide some guidance to those who are new to the world of database design.

There are some people out there who can design but never build. I have come across quite a few who cannot build what they design, and sometimes they design things which cannot actually be built without immense effort, so excuse me if I don't place any value on their opinions. It is only by building what you design that you actually experience the consequences of every design decision. It is only by using databases, by writing the queries to read and store data, that you get to understand the pros and cons of the different ideas that are floating around. I have written code which accesses databases which were designed the "old fashioned" way, and I have also written code which accesses databases which were designed with some of the "new fangled" ideas. This means that I have personally witnessed the consequences of each change from the "old" way to the "new" way, which in turn means that I am speaking from direct practical experience and not quoting some rumour, gossip, hearsay or old wive's tale.

Let me start by providing a definition of the terms which I used in this article's title:

Novice - someone who knows nothing, but knows that he knows nothing and is willing to learn.
Nincompoop - someone who knows a little yet thinks he knows it all. Someone who thinks that he knows best and refuses to listen to any criticism. Someone who has heard of or has been told of one method, and who believes that it is the *only* method. Someone who wouldn't recognise a good idea if it crawled up his leg and bit him in the a*se. A dogmatist who follows whatever rules he has been given blindly and without question, and assumes that the results, however good or bad they are, must be acceptable. Someone whose pseudo-knowledge is all second hand. Someone who is an imitator, not an innovator. Someone who is a follower, not a leader.
Ninja - someone who has seen everything, tried everything, and knows which methods are better than others, and knows when a piece of "advice" is not worth the toilet paper it's printed on. Someone who has the intelligence to question every wild theory and the balls to reject those which turn out not to have any merit. A pragmatist who questions every rule that does not produce acceptable results, and if necessary he will either rewrite those rules or discard them completely. A ninja has real knowledge which comes from first-hand experience. A ninja is an innovator, not an imitator. A ninja is a leader, not a follower.

The big problem here is when a novice receives advice he innocently assumes that the speaker actually knows what he is talking about. Unfortunately the world is full of charlatans, cowboys and purveyors of snake oil, so the poor novice is unlikely to spot when he is being directed up the wrong path. Searching the internet is just as bad as there is nothing to stop these eejits from spreading their false wisdom. If you wish to be directed down the path of enlightenment then read on.

Basic Database Rules

These are the basic database rules that have existed for decades. Some of these rules may have been relaxed in recent DBMS software, but in my humble opinion some of these changes are not for the better and should be kicked into the long grass.

Every table requires a primary key. Some databases allow a table to be created without a primary key, but there is no good reason to use this "feature".
The primary key can be of almost any data type, but should not exceed the size limit for the database (which is usually 255 characters). The primary key can also span more than one column, and need not be the first column(s) to be defined for the table.
If more than one unique key is available then one must be picked as the primary. The remainder are known as candidate keys.
Primary keys, once created, cannot (or should not) be changed. That is why surrogate/technical keys were invented.
Candidate keys (additional unique keys) can be updated as this is not supposed to have any effect on any relationships with records on child tables.
Foreign keys on a child table link to a primary key on a parent table. A child table may update its foreign key to point to a different parent, but a parent cannot (or should not) change its primary key as this would affect all child records.
All database, table and column names should use the standard character set, which is comprised of letters, numbers and underscores. Attempting to use symbols or spaces may seem "cool" at first, but eventually you will feel the pain.
All databases contain a list of reserved words which should not be used in any database, table, column or procedure names. Some databases allow reserved words if they are enclosed in special characters such as double quotes (the ANSII standard), backticks (MySQL), or '[' and ']' (SQL Server).
All databases, both past and present, are case INsensitive when it comes to names. This means that names such as "foo_bar", "FOO_BAR" and "Foo_Bar" will all resolve to the same column. This means that using mixed case names is pointless.

Secondary Database Conventions

Secondary to the basic rules are some standard practices which have stood the test of time:

After identifying the columns that need to be held on a database table the first step is to look for a natural primary key, a column (or columns) whose value will never change. Something like a person's name would not be a good choice as name changes are quite possible. Something like an ISO country code would be a good choice as a country's code never changes. New countries may emerge and old countries may disappear, but the codes are always unique. Try to keep the number of columns in the primary key as small as possible. Two or three is OK, nine or ten is very questionable.
If a natural key is not apparent it is permissible to add an extra column for this purpose. This is often known as a "surrogate" or "technical" key. The key may or may not be exposed to the outside world, and may be a number or a string, or any other permissible data type. All you have to do is supply it with a value which is unique and will never change.
All tables should be named in the singular, not the plural. Thus you should have "product" instead of "products". This is because it is referred to as "the product table" and not "the products table". When writing SQL queries it is more logical to write "product.cost" than "products.cost". It is common practice to have a table's primary key comprised of the table name with a suffix of "_id", and the name "product_id" is generally more acceptable than "products_id".
It was pointed out in the comp.databases.mysql newsgroup that SELECT * FROM customers (plural) would seem to be more logical than SELECT * FROM customer. But when you reference a column within a table the opposite would be true, so customer.customer_id would seem to be more logical than customers.customer_id. It is not possible to please everybody all of the time, so I tend to stick with the old convention which used the term "customer file" and "product file" instead of "customers file" and "products file".
All columns should have meaningful names and not short and unintelligible codes. I once worked with a company which had migrated its data from a very primitive database which had a limit of 8 characters for every name, and this habit still persisted even though this restriction no longer applied. The result of this was that all names were totally meaningless, and it was impossible to work with the database without a cheat sheet.
Some useful conventions I have followed for years are as follows:
- All identity columns should end with '_id', such as 'product_id', 'customer_id' and 'supplier_id'. An exception would be when the identity column already has a common name, such as "country_code".
- All description columns should end with '_desc', such as 'product_desc', 'product_short_desc' and 'product_long_desc'.
- Use '_name' where it is more appropriate than '_desc', such as 'customer_name'.
- All boolean columns should begin with 'is_', such as 'is_active', 'is_free' and 'is_automatic' as this clearly indicates that it is a boolean column as well as what the TRUE condition really means.
- All date/datetime/time columns should have 'date', 'datetime' or 'time' somewhere in the name. Note that 'date' columns are presumed to contain nothing but dates and 'time' columns are presumed to contain nothing but times, so don't call a column '_date' if it is supposed to contain both a date and a time. Be aware that some databases have a data type called 'timestamp' where it will automatically fill in the current date and time whenever a record is inserted and/or updated.
- Where it becomes convenient to use abbreviations instead of long names try to compile a list of agreed abbreviations so that everybody uses the same ones. It can be very confusing if one person condenses NUMBER to 'NUM' while another uses 'NO'.
While such conventions as these are not absolutely necessary, they do make it easier for a programmer to look at a column name and have a better understanding of what data it contains.
Following on from the fact that all databases are case INsensitive, trying to enforce specific case would be counter-productive, so don't try to use CamelCaps to identify different components in a name. The standard convention for many years has always been to use "lower_under" (lower case with underscores) or Snake Case, so use "foo_bar" and not "FooBar". This is even more important in languages such as PHP when referencing column names in the query result as they become case sensitive, and they use the same case that was specified in the SELECT part of the query which may not be the same as what was specified in the DDL statement. As the case could vary depending on how the string was actually written, any lookups by name on the result set would have to have exactly the same case otherwise the lookup would fail. There is only one way to circumvent this problem, and that is to force all column names in the result set to lower case. This does not cause a problem for the DBMS or the programming language, so it shouldn't cause a problem for you.
Wherever possible fields with the same content should have the same name. This means that where the same content appears on more than one table it should have exactly the same name with the same spelling, but subject to certain exceptions. Thus it is unwise to use ACCOUNT_ID on one table, ACCNO on another and ACC_NUM on a third. By using column names which are identical across tables you enable the option of using the USING clause on JOIN statements, so that
```
SELECT tableA.id, tableA.name, tableB.comment
FROM tableA
LEFT JOIN tableB ON (tableB.id=tableA.id)
```
can be reduced to:
```
SELECT tableA.id, tableA.name, tableB.comment
FROM tableA
LEFT JOIN tableB USING (id)
```
As you can see this only works if the two ID columns share the same value as well as the same name.
Columns with the same name should also have the same type and size. It would be very confusing to see a column defined as an integer on one table and a decimal or float on another.
Wherever possible fields with different content should have different names. This means that different columns should not share the same name if they have different meanings. Thus if you have an ORDER table and an ORDER_ITEM table and both contain a column called STATUS, they are both assumed to have the same range of values with the same meaning. There are also some database tools which, when encountering the same column name on different tables, will assume that they indicate a relationship based on the column values. As this is not the case these two columns should have unique names such as ORDER_STATUS and ORDER_ITEM_STATUS. This removes all confusion from both a human observer and any software tools.
Naming Exceptions: As with almost every rule there may be exceptions in certain circumstances. The same content can have different names:
- When a column must appear more than once on the same table, either because the table is being related to itself, or it has multiple foreign keys which link up to the same column on the same parent table. For example, an item moves from one location to another so there are two references to the LOCATION_ID column on the LOCATION table. The two columns should therefore have names such as LOCATION_ID_FROM and LOCATION_ID_TO. Note the use of the standard name LOCATION_ID with a suffix of either '_FROM' or '_TO'. Where you wish to represent records in a hierarchy with a senior and junior relationship you should use a suffix of either '_SNR' or '_JNR'.
- When a column name appears in a different table with different significance. For example, you may have PERSON_ID on one table, but EMPLOYER_ID, EMPLOYEE_ID or MANAGER_ID on another.
There is no advantage in using the misunderstood and much abused Hungarian Notation to prefix each database name with "db", each table with "tbl" and each column with "col". The database software is intelligent enough to work without these prefixes, and so should you.
Other bad naming conventions which have absolutely no merit and should be avoided:
- Do not add the table name as a prefix to every column name, such as PRODUCT_PRODUCT_ID, PRODUCT_PRODUCT_NAME, PRODUCT_BIN_LOCATION, etc.
- Do not define the column names in alphabetical order. You may think it looks neat, but having names defined in a logical order is, well, more logical.
- Do not include the data type in the name unnecessarily. It is good practice to give each column a meaningful name, such as PRODUCT_ID, PRODUCT_NAME, CREATED_DATE, but including its datatype, such as INT, FLOAT, DECIMAL, STRING, I think goes too far.
A practice which I have always found useful is for one person to design the database to satisfy the known business rules, then for a second person to reverse engineer the database schema and reconstruct those business rules. If there is any discrepancy between the original and reconstructed business rules then perhaps either the rules or the database schema needs a little tweaking.

Improvements which may not be for the better

Here are some changes or "improvements" which have been introduced into some modern databases, but which should not necessarily be adopted as a "good idea":

Some databases allow non-standard characters, symbols or reserved words to be used in database, table or column names provided that these names are enclosed in quotes (or backticks in MySQL). The downside is that every reference to these names in the same query must also be enclosed in quotes, and it may also make the names case sensitive instead of INsensitive. This could turn out to be more trouble than it is worth.
Some databases allow a primary key to be updated by automatically cascading the update down to all related records on child tables. This idea totally negates the purpose of a technical key, and would be very easy to abuse by allowing users to modify primary keys at will and cascade the change onto large numbers of related records which could have an impact on performance.
The original definition of a foreign key was "a non-key attribute (column) in one relation which is used as the primary key attribute (column) in another relation". The two tables are then said to have a parent-child relationship where the child contains the foreign key and the parent contains the corresponding primary key. Note that the single "attribute/column" could also be multiple "attributes/columns". This arrangement guarantees that the foreign key on the child table will identify a single row on the parent table. Even this simple arrangement has become corrupted by some database vendors:
- It is now possible for a foreign key on the child table to link to columns on the parent table which are not the primary key but an alternative candidate key. This may seem like a good idea to a novice, but as candidate keys can be altered this situation can be abused by allowing updates to cascade down onto large numbers of related records.
- Some databases allow foreign keys to be linked to a column on the parent table which is not a unique key. This means that it is no longer possible to guarantee that the foreign key will link to a single row on the parent table.
- Some databases allow the foreign key on the child table to have fewer columns than the primary key on the parent table, which again means that it is no longer possible to guarantee that the foreign key will link to a single row on the parent table.

Rules which do not exist

As well as identifying some rules which do exist, here are some rules which do not exist in modern DBMS software:

There is no rule which says that a primary key can only contain a single column. It can contain any number of columns, making it a compound key, but there are limits so common sense should prevail.
There is no rule which says that the primary key must be the first column in the table, although it is common practice.
There is no rule which says that the columns must be defined in alphabetical sequence.
There is no rule which says the columns in a compound key should be contiguous, although it is common practice.
There is no rule which says that a primary key must be an integer. It can be anything you like (within reason), but there are size limitations, so trying to use an entire BLOB field as a primary key would not work.
There is no rule which says that the integer must be from a generated sequence. Numeric fields can be any number, either entered manually or chosen at random.

Some novices quote the observation "numeric keys are faster than string keys" in order to justify their use of single integer keys where a viable string key or compound key already exists. While this observation may have had some merit decades ago when the hardware was considerably slower and the software less sophisticated, it has much less merit today. Quoting an out of date observation which is not substantiated with sufficient proof in the form of proper benchmarks could lead to wasted effort. If the effort of introducing an artificial integer key is greater than the speed improvement gained by using an integer key instead of a string key, then the net improvement is actually negative and so is not an improvement after all. This is especially true when you have to perform a lookup on a natural string key in order to obtain the integer primary key.

Common Mistakes

Here are some of the mistakes which I regularly encounter in databases which were designed by those I consider to be nincompoops:

Failure to normalise the database properly
Before moving from the draft stage of your database design to the final physical implementation it is absolutely essential that you go through a process called 'data normalisation'. This ensures that you have the right data in the right place and makes the insertion and subsequent retrieval of data as efficient and effective as possible. Without it your application could run like a pig with a wooden leg. This process is explained in more detail in The Relational Data Model, Normalisation and effective Database Design.
Every table has a technical primary key called "id" which is from an automatic sequence.
Having a column called 'id' appearing on every table breaks the different content should have different name rule. If you ever bother to look at the sample SQL queries in the database vendor's documentation you will always see column names such as 'product_id' and 'customer_id' and never just 'id'. Why is this I wonder? I once asked a novice why he followed this dubious practice, and his answer was: "It avoids duplication when typing queries. Why use 'product.product_id' when you can type less with 'product.id'?" A novice may think that this is a good idea, but a ninja knows that this "saving" comes with several gotchas:

Gotcha #1: Multi-table JOINS

What this novice failed to realise was that when the database returns a result set none of the column names is qualified with a table name. This may not matter too much with a single-table query, but what happens with a multi-table JOIN? Take a look at the following query:
```
SELECT *
FROM order AS o 
LEFT JOIN deal AS d ON (o.deal_id = d.id) 
LEFT JOIN user AS u ON (d.user_id = u.id) 
LEFT JOIN address_book AS ab ON (o.address_id = ab.id) 
LEFT JOIN product_supplier AS ps ON (d.prod_supplier_id = ps.id) 
LEFT JOIN product_price AS pp ON (pp.product_supplier_id = ps.id) 
LEFT JOIN product AS p ON (ps.product_id = p.id) 
LEFT JOIN supplier AS s ON (ps.supplier_id = s.supplier_id) 
WHERE o.status = 'A'
```
This query references eight tables which means that the result will contain eight references to a column called 'id'. This is bad enough when you look at the result in your SQL client program as you will find it impossible to relate each occurrence of 'id' back to its table of origin, but when the result is made available inside your program you will not see multiple values for a column name as all column names will become unique, which means that there will only be one value for each column name. This means that each separate value for the 'id' column will overwrite the previous value, so the value for 'id' in the result set will be from the last table in the query. So what happens when you want the 'id' value for a specific table which may or may not be the last in the query? The only solution is to specify an alias in the select list, such as:
```
SELECT *, o.id AS order_id, d.id AS deal_id, u.id AS user_id, ab.id AS address_id, p.id AS product_id, ...
```
The observant among you will see that the extra effort required to get around the problem caused by using the universal name of 'id' is greater than the savings which were assumed in the first place. The saving of *NOT* including the table name in the primary key name is more than eaten up when you have to modify your queries to include alias names which *DO* include the table name. So what you gain on the swings you lose twice over on the roundabouts. Although a human may be able to cope with the extra effort required to get around this "saving", with a proper framework the number of hand coded queries should be small to non-existent, but a computerised query generator might find it more difficult to deal with such violations of proper naming conventions.

If you think the previous idea is clever, there are those who double the cleverness by using a universal 'name' column to go with the 'id' column. This means that the SELECT string has to be enhanced even more to include the following:
```
u.name AS user_name, p.name AS product_name, ...
```
So as well as having to specify aliases for every 'id' column you now have to double the effort by including aliases for every 'name' column. Double the savings! Double the fun!

Gotcha #2: Passing context to child programs

As well as having to modify queries to deal with multiple columns which have the same name, a ninja will spot a serious problem when trying to use a framework which passes context (selection criteria) from one program to another. Suppose the database has two tables called CUSTOMER and ORDER where the ORDER table has a foreign key which links back to an entry on the CUSTOMER table. The application will therefore have two programs called "List Customer" and "List Order". Without any selection criteria the "List Order" program will retrieve all orders in the system irrespective of the customer. Now suppose in the "List Customer" program there is the ability to select a customer then press a button to go to the "List Order" program to show only those orders which are associated with that customer. The "List Customer" program has to pass the relevant selection criteria to the "List Order" program so that it can retrieve the desired records. Because these two tables are related the primary key on the CUSTOMER table is also a foreign key on the ORDER table, so the context (selection criteria) which is passed from the "List Customer" program to the "List Order" program is the primary key of the selected customer record. This context can be passed as a simple string in the format <pkey_name>='<customer_id>' such as customer_id='12345'. Provided that the column name(s) in the primary key are *exactly* the same as the column name(s) in the foreign key then the receiving program can use the passed context without the need for any additional processing.

This entire process becomes more complicated if the column names are not *exactly* the same, such as when every primary key column of every table has the same name. This means that the 'id' column in the CUSTOMER table cannot be used as the 'id' column in the ORDER table as one contains the customer identity while the other contains the order identity. The child program which receives context from a parent program now has to have additional code to translate the column name(s), so in this example 'id' has to be translated into 'customer_id'. This translation process usually means extra work for the programmer as it would not be possible for it to be automated unless the receiving program had the following information:
1. The name of the table from which the primary key was generated.
2. Details of the relationship with that table so that the name(s) of the primary key column(s) can be translated into the corresponding foreign key column(s).
This information is not usually available, so the process cannot easily be automated. It is usually the case that the effort required to implement the manual solution is much greater than the "savings" that the original design decision was supposed to generate, so that original design decision does not look so rosy in the cold light of day, now does it?
Using common names in the database makes it possible to use common methods in the code.
The novice designers like to justify their use of universal column names of 'id' and 'name' by saying that it makes it possible to use standard method names such as getByID($id) and getByName($name) in their code. A ninja will laugh at this argument as he knows that in large systems the ability to retrieve records where the selection criteria is limited to either id=n or name='x' is so restrictive that it will soon cause the system to come to a grinding halt. In the real world the variety of selection criteria is limitless, and as it would be impossible to generate a different method to deal with each possibility a ninja uses a single getData($where) method in which the $where argument is a string which is acceptable as the WHERE clause in an SQL SELECT statement. This single method would take *ANY* valid string value and would therefore deal with *ANY* selection criteria.
Unnecessary technical keys.
If there is already a suitable primary key then you will be wasting both time and disk space by creating an additional technical key. This usually follows on from using the universal 'id' column. Take a look at the following:
```
CREATE TABLE country (
  id INT(11) NOT NULL AUTO_INCREMENT,
  iso_country_code CHAR(3) NOT NULL,
  country_name VARCHAR(50) NOT NULL,
  PRIMARY KEY (id),
  INDEX iso_country_code (iso_country_code)
);
```
The column 'iso_country_code' is guaranteed to always be unique, so that should be the primary key. The addition of a separate 'id' column is just wasting space in the data table and its own index. Why use a separate table to convert a 4-byte integer into a 3-byte string? What is even worse is that the 'iso_country_code' column is defined as a non-unique index instead of a unique index (candidate key) which means that it is possible for the same country code to appear more than once.

Similar mistakes can be made with relationships between tables:
- One-to-Many relationships
  Take for example a table called ORDER and another called ORDER_ITEM where there can be many rows of ORDER_ITEM for each row of ORDER. Assuming that the primary key of the ORDER table is called 'order_id' it is possible to define the ORDER_ITEM table as follows:
```
CREATE TABLE order_item (
  id INT(11) NOT NULL AUTO_INCREMENT,
  order_id INT(11) NOT NULL,
  ...,
  PRIMARY KEY (id),
  INDEX order_id (order_id)
);
```
  While this works it is wasting space because it requires two indexes to be maintained. Compare it with the following:
```
CREATE TABLE order_item (
  order_id INT(11) UNSIGNED NOT NULL,
  item_no SMALLINT(5) NOT NULL AUTO_INCREMENT,
  ...
  PRIMARY KEY (order_id, item_no)
);
```
  There is only one index, the primary key (order_id + item_no), and there is no need to have a separate index for 'order_id' on its own as it forms the leading part of the primary key, and those of you who know how databases work will know that it is possible to use part of an existing index (provided that it is the leading part) without the need to create a separate index. Dropping the redundant integer column will therefore save 4 bytes per record without any loss in functionality or performance.
  
  Also notice the following:
  - As the value of 'order_id' never goes below zero it is a good idea to set it as 'unsigned' as this will double the range of values that it can hold.
  - As the number of items within an order is unlikely to be that large the size of the 'item_no' column can also be smaller. This saves 2 bytes for this column.
- Many-to-Many relationships
  This type of relationship can only be satisfied by creating an additional table, sometimes called an 'intersection', 'link' or 'cross-reference' table, to act as the 'many' in two one-to-many relationships. Thus instead of A <-> B (many 'A' to many 'B') you have A -> X <- B (one 'A' to many 'X' and one 'B' to many 'X') where 'X' is the xref table.
  
  It should be obvious that table 'X' requires two foreign keys, one linking up to table 'A' and another linking up to table 'B'. But what to use for the primary key? A common mistake is as follows:
```
CREATE TABLE xref (
  id INT(11) NOT NULL AUTO_INCREMENT,
  a_id INT(11) NOT NULL,
  b_id INT(11) NOT NULL,
  PRIMARY KEY (id),
  INDEX a_id (a_id),
  INDEX b_id (b_id)
);
```
  While this works there are three indexes involved, but as the combination 'a_id' + 'b_id' has not been defined as a unique key it is possible for multiple rows to exist with the same values, which would plainly be wrong. A better setup would be as follows:
```
CREATE TABLE xref (
  a_id INT(11) UNSIGNED NOT NULL,
  b_id INT(11) UNSIGNED NOT NULL,
  PRIMARY KEY (a_id, b_id),
  INDEX b_id (b_id)
);
```
  Here the 'id' column has been dropped as there is already a viable primary key. There is no need to have a separate index for 'a_id' on its own as it is already covered by its leading position in the primary key. This cannot serve as an index for 'b_id', so that requires its own index.
  
  Here is a more complicated example:
```
CREATE TABLE `product_identifier` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `product_id` INT(11) NOT NULL,
  `identifier_type_id` INTEGER(11) NOT NULL,
  `identifier` VARCHAR(20) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE INDEX `pi_identifier_type` (`identifier`, `identifier_type_id`),
  UNIQUE INDEX `pi_identifier_type_product` (`identifier`, `identifier_type_id`, `product_id`),
  INDEX `pi_identifier` (`identifier`),
  INDEX `pi_identifier_product` (`identifier`, `product_id`),
  INDEX `FI_product_id` (`product_id`),
  INDEX `FI_identifier_type_id` (`identifier_type_id`)
);
```
  The business rules which are supposed to be satisfied were stated as follows:
  - A product can have any number of identifiers, such as 'ISBN', 'GTIN', 'EAN', 'ASIN', 'MPN', etc.
  - A product may only have one identifier of any particular type.
  - A combination of identifier type and identifier value can only belong to one product.
  If those rules were to be applied with a little more thought then the following structure would serve just as well, but would require fewer indexes and less disk space:
```
CREATE TABLE `product_identifier` (
  `product_id` INT(11) UNSIGNED NOT NULL,
  `identifier_type_id` VARCHAR(4) NOT NULL,
  `identifier` VARCHAR(20) NOT NULL,
  PRIMARY KEY (`product_id`, `identifier_type_id`),
  UNIQUE INDEX `identifier_type_id` (`identifier_type_id`, `identifier`)
);
```
  Notice that the compound primary key also serves as an index for `product_id` on its own, and the compound unique key also serves as an index for `identifier_type_id` on its own. Notice also that I have changed `identifier_type_id` from a integer to a string as there is no point in having a separate table to translate a 4-byte integer into a 4-byte string.
- One-to-One relationships
  Although not very common, they do happen when an entity has optional data on another table, and this 'other' table is identified as a 'child' table but where only one row per parent is allowed. A common mistake is as follows:
```
CREATE TABLE extra_data (
  id INT(11) NOT NULL AUTO_INCREMENT,
  parent_id INT(11) NOT NULL,
  ...,
  PRIMARY KEY (id),
  INDEX parent_id (parent_id)
);
```
  While this works it requires two indexes, but as the index on 'parent_id' has not been defined as unique it is possible for multiple rows for the same parent to be created, which would plainly be wrong. A better setup would be as follows:
```
CREATE TABLE extra_data (
  parent_id INT(11) UNSIGNED NOT NULL,
  ...,
  PRIMARY KEY (parent_id)
);
```
  Here the redundant field 'id' has been dropped as the primary key field can also serve as the foreign key. This also satisfies the rule that only one row can exist for each value of 'parent_id'.
A foreign key is linked to a unique but non-primary key field.
This "feature" has only appeared recently in some databases, so is not universally adopted. I can see no good reason to link to a column which is NOT the primary key and is therefore capable of being changed.
A foreign key is linked to a non-key field.
This means that it is no longer possible to guarantee that the foreign key will retrieve a single row from the parent table. This defeats the original purpose of foreign keys.
A foreign key is linked to only part of the associated primary key.
If the primary key contains multiple columns then the foreign key should have the same number of columns. Without this one-to-one match it is no longer possible to guarantee that the foreign key will retrieve a single row from the parent table. This defeats the original purpose of foreign keys.
Foreign key constraints are set to ON UPDATE CASCADE.
This allows the key to be changed and the changed value to cascade down to all related records on child tables. This totally negates the purpose of a technical key, and if used unwisely could cause a large processing overhead.

The same name used to mean different things.

Take a look at the following DDL statements:

CREATE TABLE IF NOT EXISTS `brand` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `brand` varchar(255) NOT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE IF NOT EXISTS `product` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  ...
  `brand` int(11) DEFAULT NULL,
  ...
  PRIMARY KEY (`id`),
  CONSTRAINT `product_fk_1` FOREIGN KEY (`brand`) REFERENCES `brand` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
);

Do you see the mistake? There is a column called brand which exists on two tables, but on one it has the data type VARCHAR while on the other it is an INTEGER.

The sensible way to do this would be as follows:

CREATE TABLE IF NOT EXISTS `brand` (
  `brand_id` int(11) NOT NULL AUTO_INCREMENT,
  `brand_name` varchar(255) NOT NULL,
  PRIMARY KEY (`brand_id`)
);

CREATE TABLE IF NOT EXISTS `product` (
  `product_id` int(11) NOT NULL AUTO_INCREMENT,
  ...
  `brand_id` int(11) DEFAULT NULL,
  ...
  PRIMARY KEY (`product_id`),
  CONSTRAINT `product_fk_1` FOREIGN KEY (`brand_id`) REFERENCES `brand` (`brand_id`) ON DELETE NO ACTION ON UPDATE NO ACTION
);

Different names used to mean the same thing.

Take a look at the following DDL statements

CREATE TABLE IF NOT EXISTS `product_price` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `product_supplier_id` int(11) NOT NULL,
  `price` decimal(7,2) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `product_supplier_id` (`product_supplier_id`),
  CONSTRAINT `product_price_fk_1` FOREIGN KEY (`product_supplier_id`) 
             REFERENCES `product_supplier` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE TABLE IF NOT EXISTS `product_price_history` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `prod_supp_id` int(11) NOT NULL,
  `price` decimal(7,2) NOT NULL,
  `start_date` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_product_price_history` (`prod_supplier_id`),
  CONSTRAINT `product_price_history_fk_1` FOREIGN KEY (`prod_supplier_id`) 
             REFERENCES `product_supplier` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
);

Do you see the mistake? There are two columns with the same content, but on one it is called product_supplier_id and on the other it is called prod_supp_id. These are wrong because they are different spellings of the same name, but in certain circumstances it may be permissible to use completely different names.

Numeric fields which cannot contain negatives are not marked as 'unsigned'.
This may seem only a minor point, but as an unsigned integer has a maximum value which is twice that of a signed integer, you are effectively doubling the range of possible values without any increase in storage space. If the software has been written to only expect positive values then it would be good practice to prevent negative values being allowed in by the database.
Improper definition of BOOLEAN fields.
MySQL does not have a native BOOLEAN data type, so it is necessary to use the nearest alternative. The recommended method is as follows:
```
`is_boolean` TINYINT(1) UNSIGNED NOT NULL DEFAULT '0',
```
Note that the column name begins with 'is_', which provides an enormous hint that the column is boolean as well as what the TRUE condition actually means. I once came across a situation where a table had a column called ACTIVE_OR_INACTIVE where it was unclear whether TRUE meant 'is active' or 'is inactive'. A name such as IS_ACTIVE does not cause such confusion.

Although this allows a range of values other than 0 or 1 which are the limits for a true BOOLEAN column, 0 is always FALSE while any non-zero value (which includes 1 of course) is treated as TRUE. MySQL will allow you to test for the values TRUE or FALSE as well as the values 1 or 0 in your SELECT statement, but under some circumstances a non-zero value other than 1 may not be included in a search for TRUE. Note the following slightly different statements:
```
WHERE is_boolean IS TRUE               // will include any non-zero value
WHERE is_boolean IS NOT TRUE           // will exclude any value apart from '0'
WHERE is_boolean = TRUE                // will include the value '1' only
WHERE is_boolean != TRUE               // will exclude any value which is not '1'
WHERE is_boolean IS FALSE              // will include the value '0' only
WHERE is_boolean IS NOT FALSE          // will exclude any value which is not '0'
WHERE is_boolean = FALSE               // will include the value '0' only
WHERE is_boolean != FALSE              // will exclude any value which is not '0'
WHERE is_boolean                       // same as IS TRUE
WHERE !is_boolean                      // same as IS FALSE
```
It is also possible to define a numerical BOOLEAN as follows:
```
`some_field` INT(11) DEFAULT NULL,
```
The use of signed integers with a size greater than one is a sure sign of a sloppy mind. The specification TINYINT(1) UNSIGNED makes it more obvious that the range of possible values is severely limited.

Other possibilities are as follows:
```
`is_boolean` CHAR(1) NOT NULL DEFAULT 'N' COMMENT 'N=No, Y=Yes',
`is_boolean` ENUM('No','Yes') DEFAULT 'No',
```
Note that with these non-numeric data types it is not possible to test for the values TRUE or FALSE - you must use the actual character strings. Note also that with ENUM columns the valid values are given index numbers which start at 1, which allows an invalid value to be given the index number of 0. Such invalid values will be excluded from any search using either of the following:
```
WHERE is_boolean = 'No'                // will test for index=1 only
WHERE is_boolean = 'Yes'               // will test for index=2 only
```
Incorrect use of ENUM fields.
It seems that some novices do not understand how certain data types work before they try using them. Take the following which I encountered recently:
```
`type` enum('A','M') NOT NULL DEFAULT 'M' COMMENT 'A for Admin, M for Member',
```
As anyone who has read the manual will tell you, each entry in the list of permissible values is given an index number, and it is the index number which is stored in the database. The software can then retrieve the value associated with each index number from the ENUM list. It is not usual practice to provide a list of values which are in turn keys to yet another set of values.

The correct way to use an ENUM data type is as follows:
```
`type` enum('Admin','Member') NOT NULL DEFAULT 'Member',
```
An alternative method would be as follows:
```
`type` CHAR(1) NOT NULL DEFAULT 'M' COMMENT 'A for Admin, M for Member',
```
Trying to mix the two methods is a sure sign of a very confused mind.
Too many indexes
This was identified by Brian Cryer in the comp.databases.mysql newsgroup who used to work with someone who insisted on adding an index for every field that was used as part of a query, as in the following table:
```
CREATE Table Arrivals (
    ArrivalID int AUTO_INCREMENT,
    CountryID int,
    SiteID int,
    StationID int,
    PRIMARY KEY (ArrivalID),
    Index I1 (CountryID),
    Index I2 (SiteID),
    Index I3 (StationID)
);
```
Most of the queries against this table were of the form: Select * from Arrivals where CountryID=X and SiteID=Y;

What he didn't appreciate is that the database will only use one index (per table), so defining multiple indexes like this - especially ones which would never actually be used - is wasteful. Each index will consume disk space and processor cycles to maintain its contents. What he should have done was to define a compound index, such as:
```
    Index I1(CountryID,SiteID);
```
He went on to compound his error by simply adding compound indexes to his previous mistakes:
```
    Index I1 (CountryID),
    Index I2 (CountryID, SiteID)
```
The first of these is totally redundant as it is the leading part of the second index.
Columns containing compound values
This was identified by Herman Viaene in the comp.databases.mysql newsgroup.

While it is permissible to have compound keys which can be comprised of more than one column, it is not a good idea to have a column with compound values. By this I mean a value such as AAAA-BBBB-CCCC-NNNN where each of the different parts has a different meaning such as product type, subtype, et cetera. This used to be quite common in the pre-relational database days when compound keys were not allowed, so as much data as was necessary was packed into a single column. This would cause problems whenever one of the substrings became too short for the addition of new codes, and would also make searching on anything other than the leading substring a bit of a headache. With modern databases each of those components would be in its own column, so none of those old problems would exist.
Storing values which can be calculated or derived
In some applications there may be values which can be calculated from other values, so there is little point in storing all of them in the database. For example, in a sales order processing system you will have order items which have UNIT_PRICE, NUMBER_OF_UNITS and EXTENDED_PRICE. Only two of these values need be stored in the database as the third can be obtained from a simple calculation, such as:
EXTENDED_PRICE = UNIT_PRICE * NUMBER_OF_UNITS
or
UNIT_PRICE = EXTENDED_PRICE / NUMBER_OF_UNITS
So which values do you store in the database, and which do you calculate as and when necessary? When there is a choice between two possible calculations, one involving a multiplication and another involving a division, you should always choose the multiplication. Why? The simple answer is "accuracy". If the values contain different numbers of decimal places then the result of a multiplication, even if it is rounded afterwards, will always be more accurate than dividing with a value which may have been rounded already. If the UNIT_PRICE and EXTENDED_PRICE both have two decimal places then it might be difficult to prove this point, but suppose the UNIT_PRICE has three decimal places? Look at the following examples where the UNIT_PRICE is 4.005:
4.005 * 7 = 28.035, which can become either 28.04 (rounded up) or 28.03 (rounded down).
What happens when you store the EXTENDED_PRICE and try to calculate the UNIT_PRICE?
28.04 / 7 = 4.005714285714286 (which is rounded UP to 4.006)
28.03 / 7 = 4.004285714285714 (which is rounded DOWN to 4.004)

This becomes even more critical if you ever do currency conversions using an exchange rate which has 5 or more decimal places. The two possible calculations are:
HOME_CURRENCY_AMOUNT * EXCHANGE_RATE = FOREIGN_CURRENCY_AMOUNT
or
FOREIGN_CURRENCY_AMOUNT / HOME_CURRENCY_AMOUNT = EXCHANGE_RATE
The difference between the two calculations may appear small to you, but in the world of finance where you are expected to balance your books to the nearest penny on sums which go into millions, such minor differences are not acceptable.
Performing unnecessary database reads for values which should be stored
This is the opposite of the previous tip. A junior designer who heard about the idea of not storing values which could be calculated went further than necessary and decided that in his accounting system he would not bother to store the account balance anywhere, he would simply calculate it by summing all the values on the associated records in the transaction table which stored individual debit and credit amounts. When the number of transactions was quite small this overhead was quite small, but as the number of accounts grew and the number of transactions grew the time it took to perform this calculation became longer and longer, eventually becoming totally unacceptable.

Where a value can be calculated easily by reading a single record then the cost of that calculation is acceptable, but where the number of records you need to read can grow and grow over time then it really should be stored for speed of access, and should only be recalculated as and when necessary.
Database changes which break existing code.
One thing worse than producing a less-than-efficient design before a single line of code has been written is to then change a table's structure in such a way that it breaks that code which has been written. A typical case is when a new column is added to a table, but instead of being made optional (nullable) it is made required (not nullable) but without a default value. Any existing query which attempts to insert a record into that table will therefore fail. This is even worse in those situations where an application contains hard-coded queries in multiple places as each of those places will have to be identified and changed.

How can you have a Foreign Key without a Foreign Key Constraint?

This is because they are different things. Take a look at the following definitions:

A FOREIGN KEY is a field (or collection of fields) in one table, that refers to the PRIMARY KEY in another table. The table with the foreign key is called the child table, and the table with the primary key is called the referenced or parent table.
A FOREIGN KEY CONSTRAINT is used to prevent actions that would destroy links between tables.

A FOREIGN KEY is used in a SELECT query whereas a FOREIGN KEY CONSTRAINT is used in an INSERT, UPDATE or DELETE query.

It is possible to use a column in a child table as a foreign key which references the primary key in a parent table without having to create a foreign key constraint for that column in the child table. This would appear in a SELECT query as a JOIN, such as in the following example:

SELECT A.id, A.name, b.attribute
FROM A
LEFT JOIN B ON (b.primary_key=a.foreign_key)
WHERE ...

In this example table 'A' contains a column called 'foreign_key' which contains the value for the primary key (in a column called 'primary_key') in table 'B'. It is not necessary for column 'foreign_key' on table 'A' to be defined as a foreign key before it can be used in this way. It is not necessary for the foreign and primary keys to have the same names, but they must have the same datatypes and sizes. It is also possible for both keys to be compound or composite keys, but they must have the same number of columns, and each column in one key must match the datatype and size of the corresponding column in the other key.

Note also that in order to use a foreign key in a SELECT query you must do this explicitly by adding a JOIN clause which specifies both the identity of the joined table as well as the column names in the foreign key which link to the corresponding column names in the primary key. Even if this information is available in a foreign key constraint there is no mechanism in the database server to automatically incorporate it in a SELECT query.

On the other hand if you have defined a foreign key constraint then any action will be performed automatically by the database server:

On an INSERT to the child table the query will fail if the foreign key does not find a primary key on the parent table.
On an UPDATE or DELETE on the parent table it will behave differently depending on what action has been specified for that operation:
- RESTRICT/NO ACTION - the operation on the parent table will be rejected if any matching rows exist in the child table.
- CASCADE - will delete or update any matching rows in the child table.
- NULLIFY - all the foreign key columns on the child table will be set to NULL.

Compound keys require too many keystrokes

Some people have very dubious excuses for NOT using compound keys. Here are some comments from Composite Primary Keys:

Mladen 2007-08-24
there's only one reason i dislike composite keys: you have to type more than one column in the joins.

Bryan 2007-09-17
I tend to side with Andre on this topic. One thing that I will add when using an identity field as the PK is the reduction in columns in child tables. Especially when you have tables that are 5 or 6 levels down, and the parent tables have composite keys, and the parents parents tables have composite keys, etc. Eliminates a decent amount of columns, IMO is easier to read/understand the schema, and less SQL to write in the Joins.

Fabian 2008-11-05
Surrogates are definitely my favourite form of primary key. I would actually argue your definition of a primary key but I see the point that you are trying to get across. I loath composite keys being used loosely in a database because I don't like writing sql that involves remembering to join on multiple columns (who does).

The idea that "saving keystrokes" is more important than readability or good database design simply does not cut the mustard with me. The decision as to what type of key to use, whether it be natural, technical or compound, is not something that can be defined in a simple rule as it is something that requires experience and judgment. While any of the possibilities may actually work, some will offer better performance, or use less space on disk, or be easier to code around. You can use whatever floats your boat, but don't complain if you make the wrong choice and your boat sinks.

Compound keys make it difficult for SQL generators

I found the following comments in Composite Primary Keys:

Truy 2007-08-23
If you're coding in any sort of language above the sql layer, then having the same kind of ID (autoincremented integer) with the same name (ID) on all tables makes it possible and easy to abstract a lot of database functionality and reduces code complexity immensely. This, IMHO, is far more valuable than "being prepared for" the eventualities you mention in the article.

Ramon Leon 2007-08-24
You're argument doesn't work because you're ignoring that programs consist of much more than just schema. You approach works great if you enjoy hand writing all your sql, it's lousy when you want a middleware layer to automate CRUD. People who say the single incrementing key is simpler aren't referring to simpler schemas, they're referring to the vastly simpler programs that result from fully automating the mapping of objects into tables.

These people obviously do not have enough experience in writing code to deal effectively and efficiently with database queries. I use a framework which was specifically designed to help in the building of database applications, and it has the ability to automatically generate queries with JOINs. This is how it works:

Every table in the database (and in my ERP application there are over 400) has its own class which is automatically populated with the following metadata which is exported from my Data Dictionary:

A $fieldspec array which identifies every field/column which exists in that table along with all its specifications. This information is used to validate that each field's value supplied by the user matches that field's specifications before it it written to the database, thus avoiding the possibility of the query being rejected by the database server.
A $primary_key array which identifies which of those columns are used as the primary key. Note that it can contain any number of fields.
A $unique_keys array that identifies any additional unique/candidate keys. Note that there can be any number of these keys, and each key can contain any number of fields/columns.
A $child_relations array which identifies all those relationships between two tables where this table is the parent. Note the following:
- The fields entry identifies which column on the parent table links to which column on the child table, and this can refer to any number of columns.
- The type entry identifies the action to be taken should the parent record be deleted - either RESTRICTED, CASCADE or NULLIFY.
A $parent_tables array which identifies all those relationships between two tables where this table is the child. Note the following:
- The fields entry identifies which column on the parent table links to which column on the child table, and this can refer to any number of columns.
- The parent_field entry identifies if any of the fields in the parent table need to be added to the SELECT list in the query when it is generated. This entry may contain zero, one or more column names.

This arrangement allows me to support in my software anything which is allowed by the database, so restrictions such as "the primary key must consist of a single column which must be called ID and which must be an autoincremented integer" simply do not exist. If your middleware has such restrictions then I would throw it away and start again as it was obviously written by someone of dubious ability and has exceeded its sell-by date.

Unlike database servers which normally ignore foreign keys when constructing SELECT queries, the availability of this metadata in my software allows me to produce more advanced results. If a particular entry in the $parent_tables array does not have a value for parent_field then that entry is ignored, otherwise the query will be constructed along the following lines:

SELECT child.*, parent.parent_field_1[, parent.parent_field_2]
FROM child
LEFT JOIN parent ON (parent.fldparent1=child.fldchild1 AND parent.fldparent2=child.fldchild2)
WHERE ...

It's not exactly rocket science, so if I can do it then surely anybody can.

Conclusion

Every decision has consequences, some intended, some unintended. Every design decision has consequences for those who have to implement that design. I have little respect for designers who cannot build what they design as they have a tendency to design things which are either difficult or impossible to build. While none of the mistakes I have identified above makes it impossible to write code which uses the database, each one acts as a speed bump which slows the developer down. As any driver will tell you, too many speed bumps will have a drastic effect on your journey time. While the distance remains the same the time to reach the destination has been increased, so when a development schedule fails to be met because of too many designer-instigated speed bumps, why do the creators of those speed bumps have the audacity to blame the developers?

If I have little respect for designers who make such basic mistakes, I have absolutely zero respect for those designers who, after being told of the consequences, often unintended, of their poor design, have the arrogance to say "My design is perfect. You'll have to code around it!" In my humble opinion such arrogant nincompoops should be forced to spend time working as lavatory attendants so they get to experience what it is like dealing with other people's cr*p instead of forcing other people to deal with theirs. I once worked on a new application where the UI, application and database designers had absolutely no knowledge of how the development language worked, which meant that nothing they designed could be implemented easily. Their combined designs were so bad that it took 10 days to build 2 simple components which I can now implement, using my own framework, in under 10 minutes.

"How can you possibly reduce development times from 10 days to 10 minutes?" I hear you say. The simple answer is that I am still a hands-on developer and not a day-dreaming designer or an architecture astronaut. I design what can be built because I regularly build from my designs. When I spot a speed bump I do what I can to flatten it and reduce its effect. It doesn't matter what part of the project causes the problem - it could be the naming standards, the development standards, the application design, the database design or the UI design - anything which slows down the developer will be put under the microscope, dissected and reassembled without the problem.

This is a technique which I have used on all my projects, and I have regularly made little savings here and little savings there. Sometimes what used to be a laborious piece of coding can be converted into a simple subroutine. Sometimes whole swathes of code can be replaced by several subroutines. Sometimes a collection of subroutines can be merged into a full-blown framework. I don't have to write any SQL statements anymore because my framework does it for me, but that is only possible because I tweaked my design until it was possible. My framework can do lots of things automatically simply because I kept tweaking my design until it was possible.

Lots of little savings soon accumulate into big savings.

Good design pays off, bad design costs both in initial development and ongoing maintenance.

References

The Relational Data Model, Normalisation and effective Database Design by Tony Marston
Technical Keys - Their Uses and Abuses by Tony Marston
How to handle a Many-to-Many relationship with PHP and MySQL by Tony Marston
3 rules for naming things in your database by Noel Herrick
Choosing a Primary Key: Natural or Surrogate? by Scott Wambler
Surrogate Keys: Part 1, Best Practice or Wrong-Headed Dogma? from the Actian Community Blogs
Surrogate Keys: Part 2, The Boring Bit from the Actian Community Blogs
Surrogate Keys: Part 3, Surrogates for Composites from the Actian Community Blogs
Surrogate Keys: Part 4, In Search of a Niche from the Actian Community Blogs
Surrogate Keys: Part 5, Pitfalls, Booby-Traps and the 1970s Resurrected from the Actian Community Blogs
Composite Primary Keys from Jeff's SQL Server Blog
Distinguishing data from code from Jeff's SQL Server Blog
Data belongs in your tables -- not in your code from Jeff's SQL Server Blog
SQL Data Modeling: Entities versus Attributes from Jeff's SQL Server Blog
Don't Let Output Dictate your Database Design from Jeff's SQL Server Blog
Database Keys: The Complete Guide from DatabaseStar.com

Amendment History

13 Mar 2021	Added How can you have a Foreign Key without a Foreign Key Constraint? Added Compound keys require too many keystrokes Added Compound keys make it difficult for SQL generators
09 Nov 2012	Updated list of common mistakes to contain Too many indexes and Columns containing compound values following feedback from the comp.databases.mysql newsgroup.

counter

Tony Marston's Blog About software development, PHP and OOP

Database Design - are you Novice, Ninja or Nincompoop?