In my long career in IT I have designed and built many applications using different languages, different paradigms, and different storage systems. Among these storage systems have been flat files, indexed files, hierarchical databases, network databases and relational databases. The vast majority of my work over the past 20 years has been with relational databases which are accessed using the ubiquitous SQL language. Although the technology is constantly evolving, with new features and new ways of doing things, there are some concepts which have stood the test of time and attempts by newcomers to supplant them with new and "cool" ideas often have unintended consequences further down the line. Whenever I see a database which has been designed by someone who has yet to realise that these "cool" ideas are not so cool as he would like to think I always give a deep sigh and say "Will these novices never learn?" (I tell a lie. What I actually thinks is "WTF! What idiot designed this pile of cr*p?" but I'm too polite to say so.) I am therefore writing this article in the hope that my experiences will provide some guidance to those who are new to the world of database design.
There are some people out there who can design but never build. I have come across quite a few who cannot build what they design, and sometimes they design things which cannot actually be built without immense effort, so excuse me if I don't place any value on their opinions. It is only by building what you design that you actually experience the consequences of every design decision. It is only by using databases, by writing the queries to read and store data, that you get to understand the pros and cons of the different ideas that are floating around. I have written code which accesses databases which were designed the "old fashioned" way, and I have also written code which accesses databases which were designed with some of the "new fangled" ideas. This means that I have personally witnessed the consequences of each change from the "old" way to the "new" way, which in turn means that I am speaking from direct practical experience and not quoting some rumour, gossip, hearsay or old wive's tale.
Let me start by providing a definition of the terms which I used in this article's title:
The big problem here is when a novice receives advice he innocently assumes that the speaker actually knows what he is talking about. Unfortunately the world is full of charlatans, cowboys and purveyors of snake oil, so the poor novice is unlikely to spot when he is being directed up the wrong path. Searching the internet is just as bad as there is nothing to stop these eejits from spreading their false wisdom. If you wish to be directed down the path of enlightenment then read on.
These are the basic database rules that have existed for decades. Some of these rules may have been relaxed in recent DBMS software, but in my humble opinion some of these changes are not for the better and should be kicked into the long grass.
Secondary to the basic rules are some standard practices which have stood the test of time:
It was pointed out in the comp.databases.mysql newsgroup that
SELECT * FROM customers (plural) would seem to be more logical than
SELECT * FROM customer. But when you reference a column within a table the opposite would be true, so
customer.customer_id would seem to be more logical than
customers.customer_id. It is not possible to please everybody all of the time, so I tend to stick with the old convention which used the term "customer file" and "product file" instead of "customers file" and "products file".
Some useful conventions I have followed for years are as follows:
While such conventions as these are not absolutely necessary, they do make it easier for a programmer to look at a column name and have a better understanding of what data it contains.
SELECT tableA.id, tableA.name, tableB.comment FROM tableA LEFT JOIN tableB ON (tableB.id=tableA.id)can be reduced to:
SELECT tableA.id, tableA.name, tableB.comment FROM tableA LEFT JOIN tableB USING (id)As you can see this only works if the two ID columns share the same value as well as the same name.
Naming Exceptions: As with almost every rule there may be exceptions in certain circumstances. The same content can have different names:
Here are some changes or "improvements" which have been introduced into some modern databases, but which should not necessarily be adopted as a "good idea":
As well as identifying some rules which do exist, here are some rules which do not exist in modern DBMS software:
Some novices quote the observation "numeric keys are faster than string keys" in order to justify their use of single integer keys where a viable string key or compound key already exists. While this observation may have had some merit decades ago when the hardware was considerably slower and the software less sophisticated, it has much less merit today. Quoting an out of date observation which is not substantiated with sufficient proof in the form of proper benchmarks could lead to wasted effort. If the effort of introducing an artificial integer key is greater than the speed improvement gained by using an integer key instead of a string key, then the net improvement is actually negative and so is not an improvement after all. This is especially true when you have to perform a lookup on a natural string key in order to obtain the integer primary key.
Here are some of the mistakes which I regularly encounter in databases which were designed by those I consider to be nincompoops:
Before moving from the draft stage of your database design to the final physical implementation it is absolutely essential that you go through a process called 'data normalisation'. This ensures that you have the right data in the right place and makes the insertion and subsequent retrieval of data as efficient and effective as possible. Without it your application could run like a pig with a wooden leg. This process is explained in more detail in The Relational Data Model, Normalisation and effective Database Design.
Having a column called 'id' appearing on every table breaks the different content should have different name rule. If you ever bother to look at the sample SQL queries in the database vendor's documentation you will always see column names such as 'product_id' and 'customer_id' and never just 'id'. Why is this I wonder? I once asked a novice why he followed this dubious practice, and his answer was: "It avoids duplication when typing queries. Why use 'product.product_id' when you can type less with 'product.id'?" A novice may think that this is a good idea, but a ninja knows that this "saving" comes with several gotchas:
Gotcha #1: Multi-table JOINS
What this novice failed to realise was that when the database returns a result set none of the column names is qualified with a table name. This may not matter too much with a single-table query, but what happens with a multi-table JOIN? Take a look at the following query:
SELECT * FROM order AS o LEFT JOIN deal AS d ON (o.deal_id = d.id) LEFT JOIN user AS u ON (d.user_id = u.id) LEFT JOIN address_book AS ab ON (o.address_id = ab.id) LEFT JOIN product_supplier AS ps ON (d.prod_supplier_id = ps.id) LEFT JOIN product_price AS pp ON (pp.product_supplier_id = ps.id) LEFT JOIN product AS p ON (ps.product_id = p.id) LEFT JOIN supplier AS s ON (ps.supplier_id = s.supplier_id) WHERE o.status = 'A'
This query references eight tables which means that the result will contain eight references to a column called 'id'. This is bad enough when you look at the result in your SQL client program as you will find it impossible to relate each occurrence of 'id' back to its table of origin, but when the result is made available inside your program you will not see multiple values for a column name as all column names will become unique, which means that there will only be one value for each column name. This means that each separate value for the 'id' column will overwrite the previous value, so the value for 'id' in the result set will be from the last table in the query. So what happens when you want the 'id' value for a specific table which may or may not be the last in the query? The only solution is to specify an alias in the select list, such as:
SELECT *, o.id AS order_id, d.id AS deal_id, u.id AS user_id, ab.id AS address_id, p.id AS product_id, ...
The observant among you will see that the extra effort required to get around the problem caused by using the universal name of 'id' is greater than the savings which were assumed in the first place. The saving of *NOT* including the table name in the primary key name is more than eaten up when you have to modify your queries to include alias names which *DO* include the table name. So what you gain on the swings you lose twice over on the roundabouts. Although a human may be able to cope with the extra effort required to get around this "saving", with a proper framework the number of hand coded queries should be small to non-existent, but a computerised query generator might find it more difficult to deal with such violations of proper naming conventions.
If you think the previous idea is clever, there are those who double the cleverness by using a universal 'name' column to go with the 'id' column. This means that the SELECT string has to be enhanced even more to include the following:
u.name AS user_name, p.name AS product_name, ...
So as well as having to specify aliases for every 'id' column you now have to double the effort by including aliases for every 'name' column. Double the savings! Double the fun!
Gotcha #2: Passing context to child programs
As well as having to modify queries to deal with multiple columns which have the same name, a ninja will spot a serious problem when trying to use a framework which passes context (selection criteria) from one program to another. Suppose the database has two tables called CUSTOMER and ORDER where the ORDER table has a foreign key which links back to an entry on the CUSTOMER table. The application will therefore have two programs called "List Customer" and "List Order". Without any selection criteria the "List Order" program will retrieve all orders in the system irrespective of the customer. Now suppose in the "List Customer" program there is the ability to select a customer then press a button to go to the "List Order" program to show only those orders which are associated with that customer. The "List Customer" program has to pass the relevant selection criteria to the "List Order" program so that it can retrieve the desired records. Because these two tables are related the primary key on the CUSTOMER table is also a foreign key on the ORDER table, so the context (selection criteria) which is passed from the "List Customer" program to the "List Order" program is the primary key of the selected customer record. This context can be passed as a simple string in the format
<pkey_name>='<customer_id>' such as
customer_id='12345'. Provided that the column name(s) in the primary key are *exactly* the same as the column name(s) in the foreign key then the receiving program can use the passed context without the need for any additional processing.
This entire process becomes more complicated if the column names are not *exactly* the same, such as when every primary key column of every table has the same name. This means that the 'id' column in the CUSTOMER table cannot be used as the 'id' column in the ORDER table as one contains the customer identity while the other contains the order identity. The child program which receives context from a parent program now has to have additional code to translate the column name(s), so in this example 'id' has to be translated into 'customer_id'. This translation process usually means extra work for the programmer as it would not be possible for it to be automated unless the receiving program had the following information:
This information is not usually available, so the process cannot easily be automated. It is usually the case that the effort required to implement the manual solution is much greater than the "savings" that the original design decision was supposed to generate, so that original design decision does not look so rosy in the cold light of day, now does it?
The novice designers like to justify their use of universal column names of 'id' and 'name' by saying that it makes it possible to use standard method names such as
getByName($name) in their code. A ninja will laugh at this argument as he knows that in large systems the ability to retrieve records where the selection criteria is limited to either
name='x' is so restrictive that it will soon cause the system to come to a grinding halt. In the real world the variety of selection criteria is limitless, and as it would be impossible to generate a different method to deal with each possibility a ninja uses a single
getData($where) method in which the
$where argument is a string which is acceptable as the WHERE clause in an SQL SELECT statement. This single method would take *ANY* valid string value and would therefore deal with *ANY* selection criteria.
If there is already a suitable primary key then you will be wasting both time and disk space by creating an additional technical key. This usually follows on from using the universal 'id' column. Take a look at the following:
CREATE TABLE country ( id INT(11) NOT NULL AUTO_INCREMENT, iso_country_code CHAR(3) NOT NULL, country_name VARCHAR(50) NOT NULL, PRIMARY KEY (id), INDEX iso_country_code (iso_country_code) );
The column 'iso_country_code' is guaranteed to always be unique, so that should be the primary key. The addition of a separate 'id' column is just wasting space in the data table and its own index. Why use a separate table to convert a 4-byte integer into a 3-byte string? What is even worse is that the 'iso_country_code' column is defined as a non-unique index instead of a unique index (candidate key) which means that it is possible for the same country code to appear more than once.
Similar mistakes can be made with relationships between tables:
Take for example a table called ORDER and another called ORDER_ITEM where there can be many rows of ORDER_ITEM for each row of ORDER. Assuming that the primary key of the ORDER table is called 'order_id' it is possible to define the ORDER_ITEM table as follows:
CREATE TABLE order_item ( id INT(11) NOT NULL AUTO_INCREMENT, order_id INT(11) NOT NULL, ..., PRIMARY KEY (id), INDEX order_id (order_id) );
While this works it is wasting space because it requires two indexes to be maintained. Compare it with the following:
CREATE TABLE order_item ( order_id INT(11) UNSIGNED NOT NULL, item_no SMALLINT(5) NOT NULL AUTO_INCREMENT, ... PRIMARY KEY (order_id, item_no) );
There is only one index, the primary key (order_id + item_no), and there is no need to have a separate index for 'order_id' on its own as it forms the leading part of the primary key, and those of you who know how databases work will know that it is possible to use part of an existing index (provided that it is the leading part) without the need to create a separate index. Dropping the redundant integer column will therefore save 4 bytes per record without any loss in functionality or performance.
Also notice the following:
This type of relationship can only be satisfied by creating an additional table, sometimes called an 'intersection', 'link' or 'cross-reference' table, to act as the 'many' in two one-to-many relationships. Thus instead of A <-> B (many 'A' to many 'B') you have A -> X <- B (one 'A' to many 'X' and one 'B' to many 'X') where 'X' is the xref table.
It should be obvious that table 'X' requires two foreign keys, one linking up to table 'A' and another linking up to table 'B'. But what to use for the primary key? A common mistake is as follows:
CREATE TABLE xref ( id INT(11) NOT NULL AUTO_INCREMENT, a_id INT(11) NOT NULL, b_id INT(11) NOT NULL, PRIMARY KEY (id), INDEX a_id (a_id), INDEX b_id (b_id) );
While this works there are three indexes involved, but as the combination 'a_id' + 'b_id' has not been defined as a unique key it is possible for multiple rows to exist with the same values, which would plainly be wrong. A better setup would be as follows:
CREATE TABLE xref ( a_id INT(11) UNSIGNED NOT NULL, b_id INT(11) UNSIGNED NOT NULL, PRIMARY KEY (a_id, b_id), INDEX b_id (b_id) );
Here the 'id' column has been dropped as there is already a viable primary key. There is no need to have a separate index for 'a_id' on its own as it is already covered by its leading position in the primary key. This cannot serve as an index for 'b_id', so that requires its own index.
Here is a more complicated example:
CREATE TABLE `product_identifier` ( `id` INT(11) NOT NULL AUTO_INCREMENT, `product_id` INT(11) NOT NULL, `identifier_type_id` INTEGER(11) NOT NULL, `identifier` VARCHAR(20) NOT NULL, PRIMARY KEY (`id`), UNIQUE INDEX `pi_identifier_type` (`identifier`, `identifier_type_id`), UNIQUE INDEX `pi_identifier_type_product` (`identifier`, `identifier_type_id`, `product_id`), INDEX `pi_identifier` (`identifier`), INDEX `pi_identifier_product` (`identifier`, `product_id`), INDEX `FI_product_id` (`product_id`), INDEX `FI_identifier_type_id` (`identifier_type_id`) );
The business rules which are supposed to be satisfied were stated as follows:
If those rules were to be applied with a little more thought then the following structure would serve just as well, but would require fewer indexes and less disk space:
CREATE TABLE `product_identifier` ( `product_id` INT(11) UNSIGNED NOT NULL, `identifier_type_id` VARCHAR(4) NOT NULL, `identifier` VARCHAR(20) NOT NULL, PRIMARY KEY (`product_id`, `identifier_type_id`), UNIQUE INDEX `identifier_type_id` (`identifier_type_id`, `identifier`) );
Notice that the compound primary key also serves as an index for `product_id` on its own, and the compound unique key also serves as an index for `identifier_type_id` on its own. Notice also that I have changed `identifier_type_id` from a integer to a string as there is no point in having a separate table to translate a 4-byte integer into a 4-byte string.
Although not very common, they do happen when an entity has optional data on another table, and this 'other' table is identified as a 'child' table but where only one row per parent is allowed. A common mistake is as follows:
CREATE TABLE extra_data ( id INT(11) NOT NULL AUTO_INCREMENT, parent_id INT(11) NOT NULL, ..., PRIMARY KEY (id), INDEX parent_id (parent_id) );
While this works it requires two indexes, but as the index on 'parent_id' has not been defined as unique it is possible for multiple rows for the same parent to be created, which would plainly be wrong. A better setup would be as follows:
CREATE TABLE extra_data ( parent_id INT(11) UNSIGNED NOT NULL, ..., PRIMARY KEY (parent_id) );
Here the redundant field 'id' has been dropped as the primary key field can also serve as the foreign key. This also satisfies the rule that only one row can exist for each value of 'parent_id'.
This "feature" has only appeared recently in some databases, so is not universally adopted. I can see no good reason to link to a column which is NOT the primary key and is therefore capable of being changed.
This means that it is no longer possible to guarantee that the foreign key will retrieve a single row from the parent table. This defeats the original purpose of foreign keys.
If the primary key contains multiple columns then the foreign key should have the same number of columns. Without this one-to-one match it is no longer possible to guarantee that the foreign key will retrieve a single row from the parent table. This defeats the original purpose of foreign keys.
This allows the key to be changed and the changed value to cascade down to all related records on child tables. This totally negates the purpose of a technical key, and if used unwisely could cause a large processing overhead.
Take a look at the following DDL statements:
CREATE TABLE IF NOT EXISTS `brand` ( `id` int(11) NOT NULL AUTO_INCREMENT, `brand` varchar(255) NOT NULL, PRIMARY KEY (`id`) ); CREATE TABLE IF NOT EXISTS `product` ( `id` int(11) NOT NULL AUTO_INCREMENT, ... `brand` int(11) DEFAULT NULL, ... PRIMARY KEY (`id`), CONSTRAINT `product_fk_1` FOREIGN KEY (`brand`) REFERENCES `brand` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION );
Do you see the mistake? There is a column called brand which exists on two tables, but on one it has the data type VARCHAR while on the other it is an INTEGER.
The sensible way to do this would be as follows:
CREATE TABLE IF NOT EXISTS `brand` ( `brand_id` int(11) NOT NULL AUTO_INCREMENT, `brand_name` varchar(255) NOT NULL, PRIMARY KEY (`brand_id`) ); CREATE TABLE IF NOT EXISTS `product` ( `product_id` int(11) NOT NULL AUTO_INCREMENT, ... `brand_id` int(11) DEFAULT NULL, ... PRIMARY KEY (`product_id`), CONSTRAINT `product_fk_1` FOREIGN KEY (`brand_id`) REFERENCES `brand` (`brand_id`) ON DELETE NO ACTION ON UPDATE NO ACTION );
Take a look at the following DDL statements
CREATE TABLE IF NOT EXISTS `product_price` ( `id` int(11) NOT NULL AUTO_INCREMENT, `product_supplier_id` int(11) NOT NULL, `price` decimal(7,2) NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `product_supplier_id` (`product_supplier_id`), CONSTRAINT `product_price_fk_1` FOREIGN KEY (`product_supplier_id`) REFERENCES `product_supplier` (`id`) ON DELETE CASCADE ON UPDATE CASCADE ); CREATE TABLE IF NOT EXISTS `product_price_history` ( `id` int(11) NOT NULL AUTO_INCREMENT, `prod_supp_id` int(11) NOT NULL, `price` decimal(7,2) NOT NULL, `start_date` datetime NOT NULL, PRIMARY KEY (`id`), KEY `idx_product_price_history` (`prod_supplier_id`), CONSTRAINT `product_price_history_fk_1` FOREIGN KEY (`prod_supplier_id`) REFERENCES `product_supplier` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION );
Do you see the mistake? There are two columns with the same content, but on one it is called product_supplier_id and on the other it is called prod_supp_id. These are wrong because they are different spellings of the same name, but in certain circumstances it may be permissible to use completely different names.
This may seem only a minor point, but as an unsigned integer has a maximum value which is twice that of a signed integer, you are effectively doubling the range of possible values without any increase in storage space. If the software has been written to only expect positive values then it would be good practice to prevent negative values being allowed in by the database.
MySQL does not have a native BOOLEAN data type, so it is necessary to use the nearest alternative. The recommended method is as follows:
`is_boolean` TINYINT(1) UNSIGNED NOT NULL DEFAULT '0',
Note that the column name begins with 'is_', which provides an enormous hint that the column is boolean as well as what the TRUE condition actually means. I once came across a situation where a table had a column called ACTIVE_OR_INACTIVE where it was unclear whether TRUE meant 'is active' or 'is inactive'. A name such as IS_ACTIVE does not cause such confusion.
Although this allows a range of values other than 0 or 1 which are the limits for a true BOOLEAN column, 0 is always FALSE while any non-zero value (which includes 1 of course) is treated as TRUE. MySQL will allow you to test for the values TRUE or FALSE as well as the values 1 or 0 in your SELECT statement, but under some circumstances a non-zero value other than 1 may not be included in a search for TRUE. Note the following slightly different statements:
WHERE is_boolean IS TRUE // will include any non-zero value WHERE is_boolean IS NOT TRUE // will exclude any value apart from '0' WHERE is_boolean = TRUE // will include the value '1' only WHERE is_boolean != TRUE // will exclude any value which is not '1' WHERE is_boolean IS FALSE // will include the value '0' only WHERE is_boolean IS NOT FALSE // will exclude any value which is not '0' WHERE is_boolean = FALSE // will include the value '0' only WHERE is_boolean != FALSE // will exclude any value which is not '0' WHERE is_boolean // same as IS TRUE WHERE !is_boolean // same as IS FALSE
It is also possible to define a numerical BOOLEAN as follows:
`some_field` INT(11) DEFAULT NULL,
The use of signed integers with a size greater than one is a sure sign of a sloppy mind. The specification
TINYINT(1) UNSIGNED makes it more obvious that the range of possible values is severely limited.
Other possibilities are as follows:
`is_boolean` CHAR(1) NOT NULL DEFAULT 'N' COMMENT 'N=No, Y=Yes', `is_boolean` ENUM('No','Yes') DEFAULT 'No',
Note that with these non-numeric data types it is not possible to test for the values TRUE or FALSE - you must use the actual character strings. Note also that with ENUM columns the valid values are given index numbers which start at 1, which allows an invalid value to be given the index number of 0. Such invalid values will be excluded from any search using either of the following:
WHERE is_boolean = 'No' // will test for index=1 only WHERE is_boolean = 'Yes' // will test for index=2 only
It seems that some novices do not understand how certain data types work before they try using them. Take the following which I encountered recently:
`type` enum('A','M') NOT NULL DEFAULT 'M' COMMENT 'A for Admin, M for Member',
As anyone who has read the manual will tell you, each entry in the list of permissible values is given an index number, and it is the index number which is stored in the database. The software can then retrieve the value associated with each index number from the ENUM list. It is not usual practice to provide a list of values which are in turn keys to yet another set of values.
The correct way to use an ENUM data type is as follows:
`type` enum('Admin','Member') NOT NULL DEFAULT 'Member',
An alternative method would be as follows:
`type` CHAR(1) NOT NULL DEFAULT 'M' COMMENT 'A for Admin, M for Member',
Trying to mix the two methods is a sure sign of a very confused mind.
This was identified by Brian Cryer in the comp.databases.mysql newsgroup who used to work with someone who insisted on adding an index for every field that was used as part of a query, as in the following table:
CREATE Table Arrivals ( ArrivalID int AUTO_INCREMENT, CountryID int, SiteID int, StationID int, PRIMARY KEY (ArrivalID), Index I1 (CountryID), Index I2 (SiteID), Index I3 (StationID) );
Most of the queries against this table were of the form:
Select * from Arrivals where CountryID=X and SiteID=Y;
What he didn't appreciate is that the database will only use one index (per table), so defining multiple indexes like this - especially ones which would never actually be used - is wasteful. Each index will consume disk space and processor cycles to maintain its contents. What he should have done was to define a compound index, such as:
He went on to compound his error by simply adding compound indexes to his previous mistakes:
Index I1 (CountryID), Index I2 (CountryID, SiteID)
The first of these is totally redundant as it is the leading part of the second index.
This was identified by Herman Viaene in the comp.databases.mysql newsgroup.
While it is permissible to have compound keys which can be comprised of more than one column, it is not a good idea to have a column with compound values. By this I mean a value such as
AAAA-BBBB-CCCC-NNNN where each of the different parts has a different meaning such as product type, subtype, et cetera. This used to be quite common in the pre-relational database days when compound keys were not allowed, so as much data as was necessary was packed into a single column. This would cause problems whenever one of the substrings became too short for the addition of new codes, and would also make searching on anything other than the leading substring a bit of a headache. With modern databases each of those components would be in its own column, so none of those old problems would exist.
In some applications there may be values which can be calculated from other values, so there is little point in storing all of them in the database. For example, in a sales order processing system you will have order items which have UNIT_PRICE, NUMBER_OF_UNITS and EXTENDED_PRICE. Only two of these values need be stored in the database as the third can be obtained from a simple calculation, such as:
EXTENDED_PRICE = UNIT_PRICE * NUMBER_OF_UNITS
UNIT_PRICE = EXTENDED_PRICE / NUMBER_OF_UNITS
So which values do you store in the database, and which do you calculate as and when necessary? When there is a choice between two possible calculations, one involving a multiplication and another involving a division, you should always choose the multiplication. Why? The simple answer is "accuracy". If the values contain different numbers of decimal places then the result of a multiplication, even if it is rounded afterwards, will always be more accurate than dividing with a value which may have been rounded already. If the UNIT_PRICE and EXTENDED_PRICE both have two decimal places then it might be difficult to prove this point, but suppose the UNIT_PRICE has three decimal places? Look at the following examples where the UNIT_PRICE is
4.005 * 7 = 28.035, which can become either
28.04 (rounded up) or
28.03 (rounded down).
What happens when you store the EXTENDED_PRICE and try to calculate the UNIT_PRICE?
28.04 / 7 = 4.005714285714286 (which is rounded UP to
28.03 / 7 = 4.004285714285714 (which is rounded DOWN to
This becomes even more critical if you ever do currency conversions using an exchange rate which has 5 or more decimal places. The two possible calculations are:
HOME_CURRENCY_AMOUNT * EXCHANGE_RATE = FOREIGN_CURRENCY_AMOUNT
FOREIGN_CURRENCY_AMOUNT / HOME_CURRENCY_AMOUNT = EXCHANGE_RATE
The difference between the two calculations may appear small to you, but in the world of finance where you are expected to balance your books to the nearest penny on sums which go into millions, such minor differences are not acceptable.
This is the opposite of the previous tip. A junior designer who heard about the idea of not storing values which could be calculated went further than necessary and decided that in his accounting system he would not bother to store the account balance anywhere, he would simply calculate it by summing all the values on the associated records in the transaction table which stored individual debit and credit amounts. When the number of transactions was quite small this overhead was quite small, but as the number of accounts grew and the number of transactions grew the time it took to perform this calculation became longer and longer, eventually becoming totally unacceptable.
Where a value can be calculated easily by reading a single record then the cost of that calculation is acceptable, but where the number of records you need to read can grow and grow over time then it really should be stored for speed of access, and should only be recalculated as and when necessary.
One thing worse than producing a less-than-efficient design before a single line of code has been written is to then change a table's structure in such a way that it breaks that code which has been written. A typical case is when a new column is added to a table, but instead of being made optional (nullable) it is made required (not nullable) but without a default value. Any existing query which attempts to insert a record into that table will therefore fail. This is even worse in those situations where an application contains hard-coded queries in multiple places as each of those places will have to be identified and changed.
This is because they are different things. Take a look at the following definitions:
A FOREIGN KEY is used in a SELECT query whereas a FOREIGN KEY CONSTRAINT is used in an INSERT, UPDATE or DELETE query.
It is possible to use a column in a child table as a foreign key which references the primary key in a parent table without having to create a foreign key constraint for that column in the child table. This would appear in a SELECT query as a JOIN, such as in the following example:
SELECT A.id, A.name, b.attribute FROM A LEFT JOIN B ON (b.primary_key=a.foreign_key) WHERE ...
In this example table 'A' contains a column called 'foreign_key' which contains the value for the primary key (in a column called 'primary_key') in table 'B'. It is not necessary for column 'foreign_key' on table 'A' to be defined as a foreign key before it can be used in this way. It is not necessary for the foreign and primary keys to have the same names, but they must have the same datatypes and sizes. It is also possible for both keys to be compound or composite keys, but they must have the same number of columns, and each column in one key must match the datatype and size of the corresponding column in the other key.
Note also that in order to use a foreign key in a SELECT query you must do this explicitly by adding a JOIN clause which specifies both the identity of the joined table as well as the column names in the foreign key which link to the corresponding column names in the primary key. Even if this information is available in a foreign key constraint there is no mechanism in the database server to automatically incorporate it in a SELECT query.
On the other hand if you have defined a foreign key constraint then any action will be performed automatically by the database server:
Some people have very dubious excuses for NOT using compound keys. Here are some comments from Composite Primary Keys:
there's only one reason i dislike composite keys: you have to type more than one column in the joins.
I tend to side with Andre on this topic. One thing that I will add when using an identity field as the PK is the reduction in columns in child tables. Especially when you have tables that are 5 or 6 levels down, and the parent tables have composite keys, and the parents parents tables have composite keys, etc. Eliminates a decent amount of columns, IMO is easier to read/understand the schema, and less SQL to write in the Joins.
Surrogates are definitely my favourite form of primary key. I would actually argue your definition of a primary key but I see the point that you are trying to get across. I loath composite keys being used loosely in a database because I don't like writing sql that involves remembering to join on multiple columns (who does).
The idea that "saving keystrokes" is more important than readability or good database design simply does not cut the mustard with me. The decision as to what type of key to use, whether it be natural, technical or compound, is not something that can be defined in a simple rule as it is something that requires experience and judgment. While any of the possibilities may actually work, some will offer better performance, or use less space on disk, or be easier to code around. You can use whatever floats your boat, but don't complain if you make the wrong choice and your boat sinks.
I found the following comments in Composite Primary Keys:
If you're coding in any sort of language above the sql layer, then having the same kind of ID (autoincremented integer) with the same name (ID) on all tables makes it possible and easy to abstract a lot of database functionality and reduces code complexity immensely. This, IMHO, is far more valuable than "being prepared for" the eventualities you mention in the article.
Ramon Leon 2007-08-24
You're argument doesn't work because you're ignoring that programs consist of much more than just schema. You approach works great if you enjoy hand writing all your sql, it's lousy when you want a middleware layer to automate CRUD. People who say the single incrementing key is simpler aren't referring to simpler schemas, they're referring to the vastly simpler programs that result from fully automating the mapping of objects into tables.
These people obviously do not have enough experience in writing code to deal effectively and efficiently with database queries. I use a framework which was specifically designed to help in the building of database applications, and it has the ability to automatically generate queries with JOINs. This is how it works:
Every table in the database (and in my ERP application there are over 400) has its own class which is automatically populated with the following metadata which is exported from my Data Dictionary:
This arrangement allows me to support in my software anything which is allowed by the database, so restrictions such as "the primary key must consist of a single column which must be called ID and which must be an autoincremented integer" simply do not exist. If your middleware has such restrictions then I would throw it away and start again as it was obviously written by someone of dubious ability and has exceeded its sell-by date.
Unlike database servers which normally ignore foreign keys when constructing SELECT queries, the availability of this metadata in my software allows me to produce more advanced results. If a particular entry in the $parent_tables array does not have a value for parent_field then that entry is ignored, otherwise the query will be constructed along the following lines:
SELECT child.*, parent.parent_field_1[, parent.parent_field_2] FROM child LEFT JOIN parent ON (parent.fldparent1=child.fldchild1 AND parent.fldparent2=child.fldchild2) WHERE ...
It's not exactly rocket science, so if I can do it then surely anybody can.
Every decision has consequences, some intended, some unintended. Every design decision has consequences for those who have to implement that design. I have little respect for designers who cannot build what they design as they have a tendency to design things which are either difficult or impossible to build. While none of the mistakes I have identified above makes it impossible to write code which uses the database, each one acts as a speed bump which slows the developer down. As any driver will tell you, too many speed bumps will have a drastic effect on your journey time. While the distance remains the same the time to reach the destination has been increased, so when a development schedule fails to be met because of too many designer-instigated speed bumps, why do the creators of those speed bumps have the audacity to blame the developers?
If I have little respect for designers who make such basic mistakes, I have absolutely zero respect for those designers who, after being told of the consequences, often unintended, of their poor design, have the arrogance to say "My design is perfect. You'll have to code around it!" In my humble opinion such arrogant nincompoops should be forced to spend time working as lavatory attendants so they get to experience what it is like dealing with other people's cr*p instead of forcing other people to deal with theirs. I once worked on a new application where the UI, application and database designers had absolutely no knowledge of how the development language worked, which meant that nothing they designed could be implemented easily. Their combined designs were so bad that it took 10 days to build 2 simple components which I can now implement, using my own framework, in under 10 minutes.
"How can you possibly reduce development times from 10 days to 10 minutes?" I hear you say. The simple answer is that I am still a hands-on developer and not a day-dreaming designer or an architecture astronaut. I design what can be built because I regularly build from my designs. When I spot a speed bump I do what I can to flatten it and reduce its effect. It doesn't matter what part of the project causes the problem - it could be the naming standards, the development standards, the application design, the database design or the UI design - anything which slows down the developer will be put under the microscope, dissected and reassembled without the problem.
This is a technique which I have used on all my projects, and I have regularly made little savings here and little savings there. Sometimes what used to be a laborious piece of coding can be converted into a simple subroutine. Sometimes whole swathes of code can be replaced by several subroutines. Sometimes a collection of subroutines can be merged into a full-blown framework. I don't have to write any SQL statements anymore because my framework does it for me, but that is only possible because I tweaked my design until it was possible. My framework can do lots of things automatically simply because I kept tweaking my design until it was possible.
Lots of little savings soon accumulate into big savings.
Good design pays off, bad design costs both in initial development and ongoing maintenance.
|13 Mar 2021
|Added How can you have a Foreign Key without a Foreign Key Constraint?
Added Compound keys require too many keystrokes
Added Compound keys make it difficult for SQL generators
|09 Nov 2012
|Updated list of common mistakes to contain Too many indexes and Columns containing compound values following feedback from the comp.databases.mysql newsgroup.