Friday, February 5, 2010

My Favorite PostgreSQL 9.0 Feature

My favorite PostgreSQL 9.0 feature does not have a two-letter acronym. It's the new bytea format, available since 8.5alpha1.

At F-Secure Labs, as you might imagine, we store information about a bunch of malware samples. Throughout the computer security industry, file samples, malware or perhaps not, are referred to by a hash value, such as MD5, SHA1, or SHA256. The typical representation of such hash values in most programming environments and also in prose and literature is the hexadecimal format, for example da39a3ee5e6b4b0d3255bfef95601890afd80709. Except when you want to write a test case against PostgreSQL or want to track down a problem, you'll be looking for \3329\243\356^kK\0152U\277\357\225`\030\220\257\330\007\011, also known as \\3329\\243\\356^kK\\0152U\\277\\357\\225`\\030\\220\\257\\330\\007\\011 in some contexts.

Well, that's over now; it will show as \xda39a3ee5e6b4b0d3255bfef95601890afd80709. You will still need to take care of the backslash, but that will surely be resolved when standard_conforming_strings is turned on in version 10.0 ;-), or we implement a new, SQL-standard conforming binary string type without legacy syntax issues.

By the way, the actual origin of this feature idea was a performance problem, reported by Bernd Helmle. The new format is quite a bit faster to encode and decode: In some internal tests, pg_dump of tables with mostly binary data was twice as fast and created a dump file that was half the size with the new format compared to the old format. So hopefully everyone wins.

Another new feature in the same area, by the way, is that PL/Python now supports bytea values sanely, contributed by Caleb Welton.