Thursday, June 19, 2008

Schema design and ID fields

I'm sure everyone has created something like this once:
CREATE TABLE a (id int PRIMARY KEY, value text);

CREATE TABLE b (id int PRIMARY KEY, data text, a_id int REFERENCES a (id));
CREATE TABLE c (id int PRIMARY KEY, a_id int REFERENCES a (id), moredata text);

They all have id columns, but they are all different sets of IDs. Moreover, the same set of IDs is sometimes knows as "id", sometimes as "a_id". Some time ago I come to the conclusion that this naming scheme is bad. It makes sense locally within the table, because a.a_id would be redundant. But I think it has advantages to be redundant here anyway.

The first advantage is that you can use this join syntax:
a JOIN b USING (a_id) JOIN c USING (a_id)

If you build denormalized intermediate tables (materialized views) of this type
CREATE TABLE a_b AS SELECT * FROM a, b, c WHERE a.id = b.a_id AND a.id = c.a_id
it will fail because the join will have multiple columns named a_id. Unless using the USING join syntax, it won't know they are actually the same. So you would have to write out the column list instead of the asterisk in the select list.

Also, once your SQL queries become more complicated, say using table aliases and subselects, it is easy to lose track of what foo.id means in a particular context.

For all of these reasons, the general rule arises that a column name should be globally unique within a database schema. This should actually be checkable automatically if you allow duplicates only for columns connected by foreign key constraints.

There is another idea you can play with while developing your database schema. You start by creating (conceptually) one big table containing all the columns that are needed to store your data, which obviously requires globally unique column names. Then you use the decomposition algorithm to normalize this table down to the individual tables that you actually implement. Some might find this preferrable to converting ER diagrams.