Thursday, October 29, 2009

A History of Tarballs

I have been maintaining the autoconfigury of PostgreSQL for many years now, and every once in a while I go to ftp://ftp.gnu.org/gnu/autoconf/ to check out a new version of Autoconf. That FTP listing is actually an interesting tale of how tarball creation practices have evolved over the years.

Obviously, .tar.gz has been the standard all along. Some projects have now completely abandoned .tar.gz in favor of .tar.bz2, but those are rare. I think most ship both now. The FTP listing goes back to 1996; the first .tar.bz2 was shipped in 2001.

RPM-based distributions have switched to supporting and then requiring bzip2-compressed tarballs many years ago. Debian might start supporting that with the next release. So if you want to be able to trace your pristine tarballs throughout the popular Linux distributions, shipping both is best.

One thing that was really popular back then but is almost forgotten now is providing patches between versions, like autoconf-2.12-2.13.diff.gz. The Linux kernel still does that. Autoconf stopped doing that in 1999, when it was replaced by xdelta. Anyone remember that? This lasted until 2002 and was briefly revived in 2008. I think shipping xdeltas is also obsolete now except possibly for huge projects.

In 2003, they started signing releases. First with ASCII-armored signatures (.asc), now with binary signatures (.sig). The Linux kernel also does this, except they call the ASCII-armored signatures .sign.

In 2008, we saw the latest invention, LZMA-compressed tarballs (.tar.lzma). They appear to compress better than bzip2 by about as much as bzip2 wins over gzip. But, this one's already obsolete because it was replaced in 2009 by LZMA2, which goes by the file extension .tar.xz. Some "early adopters" such as Debian's packaging tool dpkg are in the process of adding xz support in addition to the short-lived lzma support.

Throughout all this, interestingly, tar hasn't changed a bit. Well, there are various incompatible extended tar formats around, but when this becomes a problem, people tend to revert to GNU tar.

GNU tar, by the way, supports all the above compression formats internally. gzip is -z, bzip2 is -j, lzma is, well, --lzma, and xz is -J.  And Automake supports creating all these different formats for source code distributions.

Friday, October 23, 2009

Attention PL/Proxy Users: Hash Functions Have Changed in PostgreSQL 8.4

Consider the following elementary PL/Proxy example:
CREATE FUNCTION get_user_email(username text)
RETURNS text AS $$

    CLUSTER 'userdb';
    RUN ON hashtext(username);

$$ LANGUAGE plproxy;
The integrity of this setup depends on (among other things) the hash function always giving the same result for the same username. Otherwise your calls go to the wrong partition and you won't find your data again.

Unfortunately, the hashtext() function and other hash functions have changed their implementation between PostgreSQL 8.3 and 8.4. Observe:
8.3=> SELECT hashtext('foobar');
 hashtext
-----------
 504683490
(1 row)

8.4=> SELECT hashtext('foobar');
 hashtext
-----------
 289967942
(1 row)
So when you update your proxy database from 8.3 to 8.4, you will likely make all your data invisible and/or create a big mess.

Solution? Well, in the short run: don't update quite yet. If you're just starting or you have a small database, reload all your data through the proxy instance after upgrading. The best solution for now appears to be forward-porting 8.3's hash function to 8.4 as an add-on module. Eventually, it would probably be best if PL/Proxy itself provided a stable set of hash functions.

Thursday, October 8, 2009

Rethink your text column indexing with PostgreSQL 8.4

The following item in the PostgreSQL 8.4 release notes hasn't gotten much attention:
  • xxx_pattern_ops indexes can now be used for simple equality comparisons, not only for LIKE (Tom) 
But this might significantly change how you approach indexing of text (and varchar and char (brrrh)) columns.

Let's review.  Consider a table like this:
CREATE TABLE persons (
    id int PRIMARY KEY,  -- hi Josh
    name text,
    otherdata ...
);
Since you occasionally want to look up a row by name, like
SELECT * FROM persons WHERE name = 'Smith';
you add an index like this:
CREATE INDEX persons_name_idx ON persons (name);
Then you decide that you also want to do wildcard searches like
SELECT * FROM persons WHERE name LIKE 'Smi%';
(or using POSIX regular expression; doesn't matter for this purpose). After some bemusement you discover and blindly accept that you need the following index to make that work:
CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);
And so it has become established practice, to some degree, to create two indexes on the interesting text fields: one for "normal" searches and one for pattern searches, with the ensuing hit on write performance.

Now what the mysterious release note item above says is that the pattern search index can now also be used for equality searches. So the index
CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);
can be used to speed up the query
SELECT * FROM persons WHERE name = 'Smith';
(Try it out with EXPLAIN and enable_seqscan off in 8.3 and 8.4 to see the difference.)

So what do you need the "normal" index (persons_name_idx) for, then? Well, it will help you if you do range queries, like
SELECT * FROM persons WHERE name >= 'Smith' AND name <= 'Taylor';
But really, how often do you do range queries on text fields? Not very often. So here is the new plan. Next time you index a text field, think xxx_pattern_ops by default. It might be what you want more often than not. As a small bonus, I think the pattern_ops operator classes should also be slightly faster than the default ones, because they don't go through the full locale-enabled collation processing. And if you have been thinking two indexes so far, think only one index now. A great performance bonus there.

Here's a bummer: Let's say your SSSKA membership is up for renewal and you decide to do your table like this instead:
CREATE TABLE persons (
    name text PRIMARY KEY,
    otherdata ...
);
The primary key automatically creates an index using the default operator class, but as we have discovered now, we might want to have an index with a different operator class and only that one. The constraints only need to check for equality, so it shouldn't matter which of the operator classes it uses. But there is currently no way to specify an operator class for the index supporting primary key and unique constraints. Something to think about. OK, here is a completely evil way to do this: Edit the pg_opclass system catalog and switch the opcdefault setting around between the operator classes. Experiment at your own risk.

(Note: If you are using the C locale for lc_collate, this does not apply to you. Stick with the default operator class in that case.)