Peter Eisentraut's Blog: 2009

Tuesday, December 22, 2009

Patience

Occasionally, there are concerns expressed about the adoption rate of Python 3. Now that PostgreSQL 8.5alpha3 is released with Python 3 support in PL/Python, let's see what the schedule might be until this hits real use.

Python 3.0 was released in December 2008 and was admitted to be somewhat experimental. At that point, PostgreSQL 8.4 was already in some kind of freeze, so adding a feature as signicant as Python 3 support was not feasible at that point.

In fact, we opted to do two significant rounds of fixing/enhancing/refactoring of PL/Python before tackling Python 3 support: fixed byte string (bytea) support and Unicode support. Both of those benefit Python 2 users, and they made the eventual port to Python 3 quite simple.

PostgreSQL 8.5 might release around May 2010. Debian squeeze is currently in testing and doesn't even contain Python 2.6 or Python 3.1 yet, due to some technical problems and a bit of infighting. Debian freezes in March 2010, which means without PostgreSQL 8.5 but hopefully with Python 2.6 and 3.1. The final release of squeeze should then be later in 2010, which means the earliest that a significant number of Debian users are going to look into moving any of their code at all nearer to Python 3 (via 2.6) is going to be late 2010. (Other operating system distributions will have other schedules, of course.)

The next Debian release (squeeze+1), which will presumably include PostgreSQL 8.5 or later and a solid Python 3.x environment, will then not be released before January 2012, if we believe the current 18-months-plus-slip cycle of Debian. So it will be mid-2012 until significant numbers have upgraded Debian and PostgreSQL to the then-current versions. If you are sticking to a stable and supported operating system environment, this is the earliest time you actually have the option to migrate to the Python 3 variant of PL/Python across the board for your applications. Of course in practice this is not going to be the first thing you are going to do, so by the time you actually port everything, it might be late 2012 or even 2013. Also, if you are heavily invested in PL/Python, you are probably not going to upgrade much your other Python code before PL/Python is ready.

This will then be 3 years after the PL/Python 3 code is written and 4 years after the release of Python 3.0. And 2 years before Python 2.x is expected to go out of maintenance in 2015.

So, to all users and developers: patience.

Incidentally, I fully expect to still be using IPv4 by then. ;-)

Thursday, October 29, 2009

A History of Tarballs

I have been maintaining the autoconfigury of PostgreSQL for many years now, and every once in a while I go to ftp://ftp.gnu.org/gnu/autoconf/ to check out a new version of Autoconf. That FTP listing is actually an interesting tale of how tarball creation practices have evolved over the years.

Obviously, .tar.gz has been the standard all along. Some projects have now completely abandoned .tar.gz in favor of .tar.bz2, but those are rare. I think most ship both now. The FTP listing goes back to 1996; the first .tar.bz2 was shipped in 2001.

RPM-based distributions have switched to supporting and then requiring bzip2-compressed tarballs many years ago. Debian might start supporting that with the next release. So if you want to be able to trace your pristine tarballs throughout the popular Linux distributions, shipping both is best.

One thing that was really popular back then but is almost forgotten now is providing patches between versions, like autoconf-2.12-2.13.diff.gz. The Linux kernel still does that. Autoconf stopped doing that in 1999, when it was replaced by xdelta. Anyone remember that? This lasted until 2002 and was briefly revived in 2008. I think shipping xdeltas is also obsolete now except possibly for huge projects.

In 2003, they started signing releases. First with ASCII-armored signatures (.asc), now with binary signatures (.sig). The Linux kernel also does this, except they call the ASCII-armored signatures .sign.

In 2008, we saw the latest invention, LZMA-compressed tarballs (.tar.lzma). They appear to compress better than bzip2 by about as much as bzip2 wins over gzip. But, this one's already obsolete because it was replaced in 2009 by LZMA2, which goes by the file extension .tar.xz. Some "early adopters" such as Debian's packaging tool dpkg are in the process of adding xz support in addition to the short-lived lzma support.

Throughout all this, interestingly, tar hasn't changed a bit. Well, there are various incompatible extended tar formats around, but when this becomes a problem, people tend to revert to GNU tar.

GNU tar, by the way, supports all the above compression formats internally. gzip is -z, bzip2 is -j, lzma is, well, --lzma, and xz is -J. And Automake supports creating all these different formats for source code distributions.

Friday, October 23, 2009

Attention PL/Proxy Users: Hash Functions Have Changed in PostgreSQL 8.4

Consider the following elementary PL/Proxy example:

CREATE FUNCTION get_user_email(username text)
RETURNS text AS $$

    CLUSTER 'userdb';
    RUN ON hashtext(username);

$$ LANGUAGE plproxy;

The integrity of this setup depends on (among other things) the hash function always giving the same result for the same username. Otherwise your calls go to the wrong partition and you won't find your data again.

Unfortunately, the hashtext() function and other hash functions have changed their implementation between PostgreSQL 8.3 and 8.4. Observe:

8.3=> SELECT hashtext('foobar');
 hashtext
-----------
 504683490
(1 row)

8.4=> SELECT hashtext('foobar');
 hashtext
-----------
 289967942
(1 row)

So when you update your proxy database from 8.3 to 8.4, you will likely make all your data invisible and/or create a big mess.

Solution? Well, in the short run: don't update quite yet. If you're just starting or you have a small database, reload all your data through the proxy instance after upgrading. The best solution for now appears to be forward-porting 8.3's hash function to 8.4 as an add-on module. Eventually, it would probably be best if PL/Proxy itself provided a stable set of hash functions.

Thursday, October 8, 2009

Rethink your text column indexing with PostgreSQL 8.4

The following item in the PostgreSQL 8.4 release notes hasn't gotten much attention:

xxx_pattern_ops indexes can now be used for simple equality comparisons, not only for LIKE (Tom)

But this might significantly change how you approach indexing of text (and varchar and char (brrrh)) columns.

Let's review. Consider a table like this:

CREATE TABLE persons (
    id int PRIMARY KEY,  -- hi Josh
    name text,
    otherdata ...
);

Since you occasionally want to look up a row by name, like

SELECT * FROM persons WHERE name = 'Smith';

you add an index like this:

CREATE INDEX persons_name_idx ON persons (name);

Then you decide that you also want to do wildcard searches like

SELECT * FROM persons WHERE name LIKE 'Smi%';

(or using POSIX regular expression; doesn't matter for this purpose). After some bemusement you discover and blindly accept that you need the following index to make that work:

CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);

And so it has become established practice, to some degree, to create two indexes on the interesting text fields: one for "normal" searches and one for pattern searches, with the ensuing hit on write performance.

Now what the mysterious release note item above says is that the pattern search index can now also be used for equality searches. So the index

CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);

can be used to speed up the query

SELECT * FROM persons WHERE name = 'Smith';

(Try it out with EXPLAIN and enable_seqscan off in 8.3 and 8.4 to see the difference.)

So what do you need the "normal" index (persons_name_idx) for, then? Well, it will help you if you do range queries, like

SELECT * FROM persons WHERE name >= 'Smith' AND name <= 'Taylor';

But really, how often do you do range queries on text fields? Not very often. So here is the new plan. Next time you index a text field, think xxx_pattern_ops by default. It might be what you want more often than not. As a small bonus, I think the pattern_ops operator classes should also be slightly faster than the default ones, because they don't go through the full locale-enabled collation processing. And if you have been thinking two indexes so far, think only one index now. A great performance bonus there.

Here's a bummer: Let's say your SSSKA membership is up for renewal and you decide to do your table like this instead:

CREATE TABLE persons (
    name text PRIMARY KEY,
    otherdata ...
);

The primary key automatically creates an index using the default operator class, but as we have discovered now, we might want to have an index with a different operator class and only that one. The constraints only need to check for equality, so it shouldn't matter which of the operator classes it uses. But there is currently no way to specify an operator class for the index supporting primary key and unique constraints. Something to think about. OK, here is a completely evil way to do this: Edit the pg_opclass system catalog and switch the opcdefault setting around between the operator classes. Experiment at your own risk.

(Note: If you are using the C locale for lc_collate, this does not apply to you. Stick with the default operator class in that case.)

Tuesday, September 29, 2009

How to make LaTeX Beamer prettier

Having held a number of presentations at free software conferences and training events over the years, I have played with a bunch of presentation tools and keep coming back to LaTeX Beamer as the least of many evils, just as many others who go to these sorts of events appear to have done. Like most Beamer users, it appears, I had stuck with the default theme: Warsaw, blue. It's almost a trademark of Beamer by now.

Occasionally, however, I get somewhat jealous of the look created by Powerpoint, Impress, Keynote, or some other more evil tool. So finally I have collected some tips on how to make Beamer presentations less ordinary, but not silly, just better looking. Actually, the Beamer manual contains much of this information as well, so you can check that for the details, but the Beamer manual seems more targeted toward writing a mathematics lecture, not a technical presentation for an IT conference. So here it goes:

Create your own color theme. The standard blue is boring. Your company or organization probably has brand colors; use those. Normally, they are designed to look good together. (But if your primary color is supposed to be some shade of red, be careful. Sometimes, red slides look quite bad.) If you really don't have a color scheme, google for "color scheme creator" and make one.
When you implement your color theme, be sure to also set the colors for alerts and blocks, and anything else you might use. These are good places to apply secondary colors that your company's brand guidelines might supply.
Use a different style theme. Unless you are writing a semester-length lecture series (which you are probably not, if you are reading this), avoid all the themes that have always-on table of contents, progress bar, and other distracting stuff. Most of the themes don't look very good, in my opinion, but that is partially because of the funny colors. So the advice is, change the colors and theme in unison. It's not that hard; it took me a day to figure out by trial and error. Once you are done with this, save this as your or your organization's presentation template for the future.
Use \beamertemplatenavigationsymbolsempty to remove the navigation symbols from the PDF. No one needs them, they look weird, and they kind of reveal that you are using Beamer. Put that command in your theme file.
Change the font. Well, if you are going to a conference full of Windows users and want to show off your geek tool, leave the font. On the other hand, if you want your presentation to look more like Windows, that is, PowerPoint, use the Helvetica font. There are also other good fonts available that are neither Linux nor Windows biased. Also change the fonts for the verbatim environments. The same really goes for anything produced with LaTeX. Change the fonts. The defaults are from the 80s.
I strongly recommend the "upquote" package, which makes sure that an apostrophe in a verbatim environment (where you put your code samples) is upright like the ASCII apostrophe, not curly like a quotation mark. Besides looking better and more correct, this has the very practical benefit that copying and pasting code out of your PDF slides actually works correctly. You just have to include that package and everything works.
I advise against the "listings" package, especially for SQL code. It does not know nearly all the key words, it uses funny fonts, and looks pretty ugly. If some could fix that, I would be mildly interested. For other languages, for example Python, the results are much better, if you change the fonts and colors. Results may vary. But for SQL, the verbatim environment and a modern typewriter font work much better.
For diagrams, schemas, charts, and graphics of that sort, I recommend the "tikz" package. This was my key discovery of late. In the past, creating a flow chart or a deployment diagram meant using something like Dia or OpenOffice.org to construct the image (a painful experience in itself), then exporting that file in various formats until you find one that LaTeX can accept without butchering the image, and then scaling the image to fit on the page. And then if it doesn't look good or you want to change something, you repeat the cycle. With tikz, you issue drawing commands in TeX. Sounds scary, is scary, but it's not that hard. With the excellent documentation and tutorial, you can get the hang of it in a few days. This approach has awesome advantages: The graphics use the same fonts and colors as the rest of your document (the fonts and colors you customized, remember?). When you change the font family or size or a color, your graphics follow suit! All the graphics and fonts scale without problems. And everything is in one file; you don't have to worry about exporting and importing anymore. And you can use version control and diff and make and so on. And if the graphics have a repetitive nature and you are slightly daring, you can even program your graphics in TeX, with loops and automatic layout and all that.

That's it. Compare my SQL/MED presentations before and after I applied these tips. It's actually not so difficult once you figure out the right magic, as is typically the case with TeX things.

Tuesday, September 22, 2009

How to submit a patch by email

Submitting a patch by email isn't hard, but doing it well requires some attention.

Create the patch from the top level of the source tree. That means, when you are creating the patch, you should be in the directory that has configure, or in the directory above it, or at least you should make your patch look as if you were. (Git does that for you automatically.) In other words, the patch should apply with -p0 or -p1. Even if you are submitting a patch to PL/Perl, do it from the top level. Consistency is nice.
Add all new files into the patch. If you are using anonymous CVS, you are kind of out of luck with that. Switch to Git, or if you are daring, CVSup. Or if using plain diff, add the -N option, for example diff -c -N -r. Please don't send new files separately alongside with the patch. They should be in the patch.
I personally don't subscribe to the dogmatic insistence on the diff -c format. Sometimes -c works better, sometimes -u. Judge for yourself. Make the patch readable.
Don't compress the patch. This might be controversial, but I think unless your patch is huge, compression adds annoyance and saves little else. Because the mail reader at the other end probably won't be able to open the attachment directly, sending the reader through intermediate programs or on a side-trip to the console. Replying with the patch inline to comment probably won't work either. And the mailing list web archive isn't going to do anything useful with that kind of attachment. (It is sometimes submitted that compressing an attachment prevents mangling by the mail software. That is true, but if that is a problem, you should consider getting better mail software.)
If you have to compress, stick with gzip. Some mail and web software can process that usefully. If you go with bzip or something else, those chances are decreasing drastically.
Whatever you do, there is almost never a reason to wrap a patch into a tar archive. If you must compress it, gzip is enough. If you have to send more than one file, create multiple attachments.
I'm not sure if anyone cares about diffstats. I don't. Patches are not judged by how much code they add or remove or which places they touch.
Make sure you send your patch with a proper MIME Content-Type header. You might have to look a little harder into your mail client configuration to change this. It should be text/x-diff or text/x-patch or perhaps text/plain, although I think the latter is technically incorrect. But it should be text/something, so people can open the attachment with a text viewer easily. If the attachment has a weird MIME type, you force your readers to save the attachment to disk, then open it from there, which is just annoying. Also, you'll confuse the mailing list web archive in a similar way. If you compress your patch, make extra sure that the MIME type is correct. Don't send a gzip compressed file claiming it is a tar archive.
Give your patch a descriptive file name. Not "patch", but perhaps "hot-standby.patch". Make sure your mail client includes that file name into the description of the attachment (in Content-Disposition), so your readers can save the patch under the same name. Don't use absolute file names for this ("/tmp/mypatch").
If you are using Git, you may be tempted to rebase your patch into tiny pieces and submit those as N separate email messages. Don't do that. There is nothing wrong with separating a patch into smaller patches. But the standard for commiting is not the smallest possible change that doesn't break anything, it's the smallest possible change that we would want to release if you didn't finish the next one on time.

Those are just the issues I noticed browsing through the current commit fest. Please take a moment to check your patch submission practices and your email client. A good way to check is trying to open your own patch from your own email client, and trying to view the patch in the mailing list web archive.

Monday, September 21, 2009

How to set xterm titles in psql

As most of you might know, you can customize the prompts in psql, the PostgreSQL interactive terminal. If you didn't know, you can read up on it in the documentation. You can also add color to your prompt using VT100 control sequences; an example of that can be found in said documentation. I have my shell and psql prompts in different colors so I can tell quickly when I'm logged in to what. It spares me the embarrassment of typing "ls" into psql.

As it turns out, you can use that same mechanism to customize the xterm titles from within psql. It might not be completely clear why that is necessary. After all, if you have customized the xterm titles in your shell, say to show the name of the command currently running, then it would already show something like "psql somedb" when you are using psql. The trick, or perhaps danger, with that is that if you use commands like \c or SET SESSION AUTHORIZATION in psql, then your connection parameters change, but the shell doesn't update the xterm title. This had me confused more than once now.

The control sequence to change the xterm title is ESC]0;stringBEL, where ESC is \033 and BEL is \007. In psql prompt language, this becomes [%033]0;string%007%]. Add that somewhere to your prompt (doesn't really matter where, as it won't be printed). Say you are using the default prompt (boring!) and want to show "psql dbname" in the xterm window title when logged in to PostgreSQL, then put this into your ~/.psqlrc:

\set PROMPT1 '%[%033]0;psql %/%007%]%/%R%# '

Note: If you don't have your shell to set up to update the xterm title, then the title set by psql will remain after you quit psql. That might be even more confusing, so don't try this if you don't have your shell set up for this.

If you have other fun and/or useful ideas for the psql prompt, please share them in the comments.

Sunday, August 30, 2009

PL/XSLT

Let me introduce the procedural language that you didn't yet know you needed: PL/XSLT.

PL/XSLT isn't actually that new. I faintly recall that I wrote it around the time of the Anniversary Summit in 2006, but never published it. Today I found the source code again, cleaned it up a bit, and put it up on GitHub: http://github.com/petere/plxslt/

Here is roughly how this would look:

CREATE FUNCTION foo(xml) RETURNS xml AS $$
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xsl="http://www.w3.org/1999/XSL/Transform">
...
</xsl:stylesheet>
$$ LANGUAGE xslt;

There are some conditions on the function definition:

The first parameter must be of type xml. It will receive the input document. Every function must have at least this one parameter.

The following function parameters will be supplied to the stylesheet as XSL parameters.

The return type must match the output method specified by the stylesheet. If it is xml, then the return type must be xml; if it is text or html, the return type must be text or varchar.

Triggers are not supported.

In the source code, I include an example function that converts the output of the table_to_xml etc. functions to HTML.

Of course, the question will arise whether this is sensible. I guess it is about as sensible as considering XSLT a programming language. Judge for yourself. :-)

In the words of GitHub: fork the project and add to it.

Friday, August 28, 2009

Why I Dumped KDE

I had been a KDE fanboy for about 10 years. I remember compiling version 1.0 from source myself. But now I had to put an end to it. Let this be a lesson to us all. If you produce software that is

unstable
unintuitive and weird
lacking important functionality

then your users will be upset.

As a user of free software and Debian testing as my primary desktop operating system, I have a degree of tolerance for unstable software. In fact, two of the biggest culprits in the past, KMail and Konqueror, have become much more stable in KDE 4 compared to KDE 3. But Plasma kept crashing or running nuts with resources, and there were frequent rendering problems that caused windows and icons to disappear, causing me to restart my sessions every so often. I don't need that. And I didn't even dare to stress the enhanced desktop effects (compositing features) beyond a few trials. In the meantime, me and lots of other users who dared to adopt KDE 4.0 or 4.1 had to spend hours reporting interface bugs and regressions that were all probably reported once before but got reintroduced accidentally during the big rewrite.

The second problem was that KDE had set out to reinvent the desktop, and ended up giving users a desktop environment that they did not understand and that they could no longer use in the way they were accustomed to. The new theme and widget style were hard to read and fairly hard to use without the enhanced desktop effects also on. The panels wouldn't stay where I wanted them, and the things on the panels wouldn't either. And why after we have finally gotten to the point that all major free desktop environments can share the system tray, everything in KDE is now a plasma widget that is incompatible with everything else. The widgets, the menus, everything just didn't feel right. The interface of Kopete, the instant messaging client, had me constantly guessing who I was talking to and why. KMail had some good new ideas, but I had to spend hours disabling most new features so I could continue reading my mail. And the network manager applet, still in Debian experimental admittedly, was so weird to use that I'm amazed it ever connected to anything at all.

Which brings me to the third point, the one that finally did it for me. They wrote a new desktop environment, but forgot the include the functionality that people actually need to use. No stable network management. No printer setup. No bluetooth support. No Debian update manager support. The entire groupware suite still cannot synchronize with hardly anything. Not my phone, not Google, not MS Exchange. Instead, there is now a MySQL server running on my desktop with the promise that it will be able to store all my contacts at some point. Or something. And Konqueror, now demoted to merely a web browser, has as far as I can tell, not been developed at all since KDE 3.5. It has serious rendering problems with many currently popular web sites and has none of the extensions that Firefox offers. You can probably synchronize the bookmarks with Akonadi someday. Or something. So, in practical terms, if you subtract all that, you are basically left with nothing, and might as well not use KDE at all.

So what now? I had good experiences with LXDE before, so I went with that. It has worked out well so far: visually appealing, lightweight, yet functional, and you can smoothly mix in KDE and GNOME applications. Well, that's LXDE plus OpenBox, to spread the credit fairly. I got Iceweasel configured to do most of the things that I once appreciated Konqueror for, and with the right extensions it's much better. Evolution isn't as configurable as KMail, but at least it can connect my calendar and addressbook to the external sources that I need. Pidgin gives me more protocol options and a saner interface for instant messaging. I hung on to the KDE konsole application, which has nice enhancements in KDE 4. Also, Amarok, possibly another one of those rewrites/reinventions gone wrong, seems to be without usable alternative at the moment. (Kaffeine has not been ported to KDE 4, and all the Gtk/GNOME-type applications I tried didn't do what I needed them to do.)

The Linux desktop, for lack of a better word, is becoming irrelevant. If you are a hacker, all you really need is a browser and a terminal. And maybe Emacs. If you are a normal user, all you need is a browser and an office suite. And maybe Skype. There is not much room for differentiation left there.

Sunday, August 16, 2009

Adding Color to the Console: Round 2: Highlight vs. Pygments vs. Source-highlight

A month ago I compared the Pygments and Source-highlight packages for use as a syntax highlighting tool. A commenter pointed me toward another package called Highlight. So let's compare those three, based on the same criteria as last time:

	Pygments	Source-highlight	Highlight
Debian package	python-pygments	source-highlight	highlight
Dependencies	Python	C++, Boost	C++
License	BSD	GPL 2+	GPL 3+
Input Languages	136	30	143
Output Languages	~11	~6	~9
Styles	16	5	40
Performance test	6 min	64 sec	67 sec

So for interactive use, it looks like a pretty close call between Source-highlight and Highlight, unless you need one of the additional input formats supplied by Highlight. It appears that Highlight supports SGML files out of the box, unlike the other two, which could make it interesting for me.

Wednesday, July 29, 2009

How to find all tables without primary key

Getting a list of all tables that don't have a primary key is occasionally useful when you need to update a legacy schema or repair someone else's database "design". There are a few recipes out there that peek into the PostgreSQL system catalogs, but here is a way that uses the information schema:

SELECT table_catalog, table_schema, table_name
    FROM information_schema.tables
    WHERE (table_catalog, table_schema, table_name) NOT IN
          (SELECT table_catalog, table_schema, table_name
               FROM information_schema.table_constraints
               WHERE constraint_type = 'PRIMARY KEY')
      AND table_schema NOT IN ('information_schema', 'pg_catalog');

The last line is useful because many predefined tables don't have primary keys.

Since you're using the standardized information schema, this query should be portable to other SQL database systems. The identical query works on MySQL, for example.

Monday, July 27, 2009

The UserVoice Votes Have Materialized

A few weeks ago, I started a PostgreSQL feedback forum at UserVoice. The votes have stabilized now, so it's time to take a first look.These are the five most requested features:

Hot standby
Materialized views
Synchronous replication
Parallel queries
Better administration and monitoring tools

1, 3, and 5 should not be surprises; we already saw them among Josh's development priorities. Parallel queries are also mentioned occasionally. But it's interesting to see materialized views so high.

I remember many years ago, materialized views were a somewhat hot topic, but things have cooled down significantly now. In a broader sense, materialized views are queries caches, and I did hear something about that at PGCon. In the meantime, Jonathan Gardner's article Materialized Views in PostgreSQL appears to remain the most frequently cited treatment on the subject. (The original link is down, unfortunately.) The PGCon 2008 tutorial Materialized Views that Really Work by Dan Chak is also worth looking at.

So, who's up for implementing built-in materialized views/query caching?

Also, if you haven't done so, submit your suggestions and votes at the UserVoice forum.

Saturday, July 18, 2009

How can I get the compiler to check my SQL queries?

Over at Coding Horror, I found this piece of code:

var posts = 
  (from p in DB.Posts
  where 
  p.PostTypeId == PostTypeId.Question &&
  p.DeletionDate == null &&
  p.Score >= minscore
  orderby p.LastActivityDate descending
  select p).
  Take(maxposts);

This is C# code with a piece of infrastructure called LINQ to SQL. And I learn:

The big hook here is that this is code the compiler actually understands. You get code completion, compiler errors if you rename a database field or mistype the syntax, and so forth. Perhaps best of all, you get an honest to goodness post object as output!

Is there any open-source stack, possibly involving PostgreSQL, that can do that? And I don't mean ECPG, which can check the SQL syntax, but does not verify the existence and types of tables, columns, etc. at compile time.

Any ideas?

Thursday, July 16, 2009

Adding Color to the Console: Pygments vs. Source-highlight

The other day I write about code syntax highlighting with less and source-highlight, when someone pointed out the Pygments package, that can do the same thing. This called for

the great source-highlight vs. pygments face-off

Installation

Source-highlight is in C++ and needs the Boost regex library. About 3.5 MB together. Pygments is in Python and has no dependencies beyond that. About 2 MB installed. No problem on either side.

In Debian, the packages are source-highlight and python-pygments. Note that the Pygments command-line tool is called pygmentize.

Source-highlight is licensed under the GPL, Pygments under a BSD license.

Getting Started

pygmentize file.c

writes a colored version of file.c to the standard output. Nice.

source-highlight file.c

writes a colored version of file.c to file.c.html. As I had written before, the correct invocation for this purpose is

source-highlight -fesc -oSTDOUT file.c

That makes pygmentize slightly easier to use, I suppose.

Supported Languages

Source-highlight supports 30 languages, Pygments supports 136.

Source-highlight can produce output for DocBook, ANSI console, (X)HTML, Javadoc, LaTeX, Texinfo. Pygments can produce output for HTML, LaTeX, RTF, SVG, and several image formats.

Source-highlight supports styles, but only ships a few. Pygments supports styles and filters, and ships over a dozen styles.

So more options with Pygments here.

Also note that Pygments is a Python library that can be used, say, in web applications for syntax highlighting. This is used in Review Board, for example. Source-highlight is just a command-line tool, but it could of course also be invoked by other packages. Horde uses this, for instance.

Speed

To process all the C files in the PostgreSQL source tree (709271 lines), writing the ANSI console colored version of each file.c to file.c.txt:

source-highlight: 25 seconds
pygmentize: 5 minutes

So for interactive use with less, source-highlight is probably still the better option.

Robustness

pygmentize gave me a few errors of this sort during the processing of the PostgreSQL source tree:

*** Error while highlighting:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 83: ordinal not in range(128)
  (file "/usr/lib/python2.5/site-packages/Pygments-1.0-py2.5.egg/pygments/formatters/terminal.py", line 93, in format)

*** Error while highlighting:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 196-198: invalid data
  (file "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode)

That probably shouldn't happen.

source-highlight gave no spurious errors in my limited testing.

Miscellaneous

Source-highlight can highlight its own configuration files, which are in a custom language, and Pygments' configuration files, which are in Python. Conversely, Pygments can of course highlight its own configuration files, but doesn't know what to do with those of Source-highlight.

Summary

I will stick with Source-highlight for interactive use, but Pygments is a great alternative when you need more formatting options or want to integrate the package as a library.

Wednesday, July 15, 2009

Solid-State Drive Benchmarks With Latency Measurements

After my previous round of SSD benchmarks, I was asked to repeat the tests with a newer bonnie version to get some latency measurements. Here are the results:

Version      1.96   ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
hdd             64G           13939   2 10973   2           56604   5  42.0   1
Latency                       20356ms    3310ms             60419us     593ms
hdd             64G           13957   2 10933   2           56628   5  42.7   1
Latency                        4956ms    2243ms             66937us     523ms
hdd             64G           13938   2 10958   2           56613   5  42.7   1
Latency                        5259ms    2577ms             94416us     609ms
hdd             64G           13906   2 10976   2           56602   5  42.5   2
Latency                       20689ms    2510ms             56957us     593ms
hdd             64G           13914   2 10990   2           56635   5  42.2   2
Latency                        5266ms    2357ms             94285us     518ms
ssd             48G           106763  19 63722  12           254350  21  1825  72
Latency                       15396ms     421ms             53517us    7830us
ssd             48G           110696  19 63812  10           254238  21  1971  69
Latency                       12050ms     456ms             10615us    1334us
ssd             48G           107193  19 64260  11           254179  21  1823  79
Latency                       14950ms     789ms             10542us    8046us
ssd             48G           108723  19 63380  11           254226  21  1801  74
Latency                       14118ms    1720ms             40106us    1477us
ssd             48G           108848  19 63363   9           254035  21  1960  75
Latency                       15320ms    1717ms             10895us    1387us
hdraid          64G           280086  58 115494  21           450230  40 552.0  49
Latency                        1685ms    1325ms               619ms   33518us
hdraid          64G           283360  55 115566  20           467900  45 589.1  42
Latency                        1738ms    1109ms               298ms   33042us
hdraid          64G           285003  56 115642  20           460394  43 571.2  37
Latency                        1890ms    1272ms               759ms   46262us
hdraid          64G           286838  56 115059  20           453657  43 570.1  43
Latency                        1808ms    1065ms               513ms   65779us
hdraid          64G           291125  57 116871  21           445977  43 589.4  40
Latency                        1081ms    1057ms               577ms   97685us

I have posted an HTML version of this table at http://wiki.postgresql.org/wiki/User:Petere/SSD_benchmarks, for easier viewing.

Legend: hdd = single hard disk, ssd = Intel X25-E, hdraid = 8 hard disk RAID 10 array

The random seek numbers aren't quite the same as in the last test. I have at times noticed deviations of up to 50% over many measurements.

So the SSD beats everything else in latency, but not by much in the sequential tests. It's difficult to say, however, what exactly these numbers might mean in practice.

A side observation is the 75% or so CPU utilization during the random seek test on the SSD. With multiple disks, you might actually end up exhausting the CPUs before the disks!?!

Saturday, July 11, 2009

Adding Color to the Console: Code Syntax Highlighting with less and source-highlight

Syntax highlighting is a standard feature of every nontrivial editor, but if you just want to look at a file or the output of a command, then you will often only stumble upon niche solutions such as colordiff, or you end up using vim as your file viewer. But one very flexible program that apparently doesn't get much attention is source-highlight, available from the GNU project or as a Debian package. Besides being able to syntax-highlight all the usual programming and markup languages, it can also produce various output formats such as HTML, LaTeX, and of course plain text.

The standalone usage is a bit bizarre, because it tries to produce HTML by default and write the output to a file named infile.html. To produce colored plain text on standard output, use something like

source-highlight -fesc -oSTDOUT FILENAME

To get the most out of this, integrate it with the less pager program, so all the files you look at are automatically colored. Create a file .lessfilter in your home directory with the following contents:

#!/bin/sh

source-highlight -fesc -oSTDOUT "$1" 2>/dev/null

and make this file executable. Be sure to redirect the error output away as shown, so that you don't get annoying messages when source-highlight complains in case it doesn't know the format of the file. Then put the following in your shell startup file (.bashrc or .zshrc):

export LESS="-R"
eval "$(lesspipe)"

The first line makes sure that less will show colors, and the second line sets things up so that the .lessfilter file is invoked.

That's it. It recognizes the file type by the file name extension, so try anything like

less somefile.c
less somefile.py
less somefile.patch

to see the colors. It won't work when less is at the end of a pipe; I have some ideas on how to work around that, which I will post here later.

Wednesday, July 8, 2009

I Just Have One Question

Will the Google Chrome OS include Mono in the default install?

Schema Search Paths Considered Pain in the Butt

OK, I've had it. Who ever thought that a search path was a good idea anyway? I'm talking about the schema search path in PostgreSQL in particular, but any other ones are just as bad. If you review the last 48 hours in the PostgreSQL commit log, you can see what kind of nonsense this creates, and what amount of resources it drains (with apologies to my fellow developers who had to clean up my mistakes).

The Schema Search Path

What is the schema search path? One way to describe it is that it allows you to specify the objects in which schemas can be addressed without schema qualification, and in which order those objects are resolved. So if you write

CREATE TABLE schema1.foo ( ... );
CREATE TABLE schema2.bar ( ... );
SET search_path TO schema1, schema2;

then you can address these tables as just

SELECT ... FROM foo, bar ...

instead of

SELECT ... FROM schema1.foo, schema2.bar ...

Another way to describe it is that it allows you to set hidden traps that mysteriously make your perfectly good SQL behave in completely unpredictable ways. So a perfectly harmless query like

SELECT * FROM pg_class WHERE relname ~ 'stuff$';

can be changed to do pretty much anything, if the user creates his own pg_class and puts it first into the search path, before the system schema pg_catalog.

The way to deal with this, to the extent that it is dealt with at all, is to either set the schema search path everywhere before accessing any object (for some value of everywhere, such as every script, every session, every function) or to explicitly schema-qualify every object reference. The problem with the former is that it negates the point of allowing the user to set the search path in the first place. What you'd need to do instead is to prepend your known search path to what was set already, which is a pain to program in SQL, which is why probably hardly anyone does it that way. The problem with the latter is that it makes schemas pointless, because if you have to qualify everything, you might as well make the schema part of the name ("pg_catalog_pg_class") and forget about the extra indirection.

Qualify Everything!

The way we usually deal with this in PostgreSQL core code, for example in the psql describe code, is to add pg_catalog qualifications everywhere; see this last commit for instance. In my small example above, this would become

SELECT * FROM pg_catalog.pg_class WHERE relname ~ 'stuff$';

But this, and the above commit and all of psql is still wrong, because the actually correct way to write this is

SELECT * FROM pg_catalog.pg_class WHERE relname OPERATOR(pg_catalog.~) 'stuff$';

because PostgreSQL of course allows user-defined operators, and those are subject to search path rules like anything else. Proof:

CREATE OPERATOR = ( PROCEDURE = charne, LEFTARG = "char", RIGHTARG = "char" );
SET search_path = public, pg_catalog;
\diS

This command is supposed to show all indexes, but now it will show everything that is not an index. Try this yourself; it's a great way to mess with your colleagues' heads. :-) Anyway, surely no one wants to schema-qualify all operators, which is why attempting to qualify everything is probably a hopeless proposition.

Unix Shells

Let's look for some analogies to this dreaded technology. Unix shells have had search paths for decades. I don't have a whole lot of insight into their history, or what they where originally intended for, but the way I see it now, from the point of view of an operating system development contributor myself, they are just as much a pain in the butt in their full generality. Because users can set hidden traps that can make any old harmless command do anything, various programming and security pains persist. The usual answers are again: set the path explicitly in every shell script, or add an explicit path to every command. The latter is obviously rarely done, and would by the way possibly fall into the same operator trap that I mentioned above. Who knew that [ (as in if [ "$foo" = bar ]) might actually be a separate executable:

$ ls -l '/usr/bin/['
-rwxr-xr-x 1 root root 38608 2009-06-05 03:17 /usr/bin/[

Many authors of shell scripts probably know that setting an explicit path at the top of the script is probably a good idea. Or is it? On this Debian system, I have 242 shell scripts in my /usr/bin/, 181 of which don't appear to set a path. 14 out of 28 scripts in /etc/init.d/ don't set a path. And out of the 14 that do, there are 12 different variants of the path that they actually choose to set. Which makes any actually sensible use of the path impossible. One such use might be to put a locally modified version of a common program into /usr/local/bin/. Now some scripts will use it, some will not.

Linux distributions have been trying to move away from relying on this path business anyway. The Filesystem Hierarchy Standard only allows for a few directories for executable programs. Everything else is supposed to be organized by the package manager, and in practice perhaps by the occasional symlink. Solaris has a much greater mess of bin directories, such as /usr/gnu/bin, /usr/xpg4/bin, but even there I sense that they don't want to keep going into that direction.

Whither PostgreSQL?

In PostgreSQL, you might get the worst of all worlds:

Explicitly qualifying everthing is completely unwieldy, and perhaps impossible if you can't modify the SQL directly.

Programmatically modifying the search path in a correct way is pretty complicated.

There is no standard for what goes into what schema.

There is talk of adding more "standard schemas", potentially breaking all those applications who managed to get one of the above points right.

Using the search path to override a system object by your own implementation doesn't work, at least in half of the applications that choose to do the supposedly right thing and alter the search path to their liking.

Security definer functions (analogous to setuid scripts) don't have a safe default for the search path. You need to make sure you set a safe path in every one of them yourself (see also CVE-2007-2138). One wonders how many of those out there are broken in this regard. At least setuid shell scripts are usually not possible.

What might be a better solution? More modern programming languages such as Java or Python with their "import" mechanisms, where you explicitly choose what you want in what namespace, appear to have less problems of the sort I described. Of course there is still a CLASSPATH or a PYTHONPATH in the background, but that's actually just a workaround because the underlying operating system relies on search paths and doesn't have an import mechanism itself.

I think the straightforward solution at the moment is to ignore the whole schema issue as much as you can and don't let random people issue random SQL statements. Probably not completely safe, but at least it doesn't tangle you up in a giant mess.

If someone wants to defend the current system or has proposals for fixing it, please leave comments.

(picture by david.nikonvscanon CC-BY)

Tuesday, July 7, 2009

StackOverflow Overflows My Stack

StackOverflow is a pretty neat idea for a web site. It's one of those forums where you ask a question and others answer, and this one is particularly for questions about programming. They really put a lot of thought into it to make this a useful resource for programmers. I recommend watching the presentation that creator Joel Spolsky gave at Google about this, just to learn about the sort of thoughts that went into creating a modern, useful web site.

Except the whole thing is pretty much completely useless to me. Here is why. In order to contribute to StackOverflow, you answer questions that others have posted, and you can vote on answers. There is a slightly sophisticated system behind this that will ultimately establish one or two authoritative answers on a particular question, and then everyone is happy. So the first approach to contributing may be finding unanswered questions. That's easy, it's linked right from the home page: http://stackoverflow.com/unanswered. Now in order to find questions that matter to you, or that you might have some knowledge about, all the questions are tagged. Tags might be javascript, sql, vb.net, and so on. So someone like me might go to http://stackoverflow.com/questions/tagged/postgresql. But this will include all questions ever asked about PostgreSQL. To actually hang out and contribute on occasion, I would like to have all unanswered questions for a tag or set of tags, right? No problem, go to http://stackoverflow.com/unanswered/tagged/postgresql; this link has the tool tip "show unanswered questions tagged 'postgresql'". Except this doesn't work; it shows all unanswered questions, ignoring the tag.

I was confused about this for a while, and couldn't really work with the site because of it. But the design of the site is so clear that it was almost impossible that I misunderstood the intents of the various pages, and so I decided that I was at the right URL and the results were wrong. Must be a bug.

At the time, the feedback link on the StackOverflow site led to http://stackoverflow.uservoice.com/. And sure enough, if you search for "unanswered tagged", you get about 11 essentially duplicate results, one of which is claimed to be "completed", and nine of which are "declined" with comments such as "make sure you are on the right tab", "you are on the wrong tab, look again" and "I am on the unanswered questions tab. It is still broken."

Sometime within the last two weeks they must have switched their feedback forum to themselves, at http://meta.stackoverflow.com/. And sure enough, the same complaints have reappeared: http://meta.stackoverflow.com/search?q=unanswered+tagged.

What's so puzzling about this to me is that on a site that had put so much thought into usability, exactly the only one feature that I wanted to use doesn't work, in spite of obviously having been designed in. And I suppose if you can show unanswered questions and tagged questions, you should be able to show unanswered tagged questions without too much trouble. Which just shows that many people must have many different workflows, and there will be many solutions to satisfy them all. See you later, StackOverflow.

(picture by nicolasnova CC-BY)

Monday, July 6, 2009

Solid-State Drive Benchmarks and the Write Cache

Like many people, we are interested in deploying solid-state drives (SSDs) for our database systems. Jignesh posted some performance test results a while ago, but as I had commented there, this test ran with the write cache on, which concerned me.

The Write Cache

Interlude: The disk write cache is the feature that causes you to lose your data when the server machine crashes or loses power. Just like the kernel may lie to the user-land application about writing stuff to disk unless the user-land application calls fsync(), the disk may lie to the kernel about writing stuff to the metal unless the write cache is turned off. (There is, as far as I know, no easy way to explicitly flush the cache. So write cache off is kind of like open_sync, if you know what that means.) As PostgreSQL pundits know, PostgreSQL does fsyncs at the right places unless you explicitly turn this off, and ignore all the warning signs on the way there. By contrast, however, the write cache is on by default on consumer grade ATA disks, including SATA disks and, as it turns out, also including "enterprise" SSD SATA devices.

To query the state of the write cache on a Linux system, use something like hdparm -W /dev/sda. To turn it off, use hdparm -W0 /dev/sda, to turn it back on, hdparm -W1 /dev/sda. If this command fails, you probably have a higher-grade RAID controller that does its own cache management (and doesn't tell you about it), or you might not have a disk at all. ;-) Note to self: None of this appears to be described in the PostgreSQL documentation.

It has been mentioned to me, however, that SSDs require the write cache for write wear leveling, and turning it off may significantly reduce the life time of the device. I haven't seen anything authoritative on this, but it sounds unattractive. Anyone know?

The Tests

Anyway, we have now gotten our hands on an SSD ourselves and gave this a try. It's an Intel X25-E from the local electronics shop, because the standard, big-name vendor can't deliver it. The X25-E appears to be the most common "enterprise" SSD today.

I started with the sequential read and write tests that Greg Smith has described. (Presumably, an SSD is much better at being better at random access than at sequential access, so this is a good worst-case baseline.) And then some bonnie++ numbers for random seeks, which is where the SSDs should excel. So to the numbers ...

Desktop machine with a single hard disk with LVM and LUKS over it:

Write 16 GB file, write caching on: 46.3 MB/s
Write 16 GB file, write caching off: 27.5 MB/s
Read 16 GB file: 59.8 MB/s (same with write cache on and off)

Hard disk that they put into the server that we put the SSD in:

Write 16 GB file, write caching on: 49.3 MB/s
Write 16 GB file, write caching off: 14.8 MB/s
Read 16 GB file: 54.8 MB/s (same with write cache on and off)
Random seeks: 210.2/s

This is pretty standard stuff. (Yes, the file size is at least twice the RAM size.)

SSD Intel X25-E:

Write 16 GB file, write caching on: 220 MB/s
Write 16 GB file, write caching off: 114 MB/s
Read 16 GB file: 260 MB/s (same with write cache on and off)
Random seeks: 441.4/s

So I take it that sequential speed isn't a problem for SSDs. I also repeated this test with the disk half full to see if the performance would then suffer because of the write wear leveling, but I didn't see any difference in these numbers.

A 10-disk RAID 10 of the kind that we currently use:

Write 64 GB: 274 MB/s
Read 64 GB: 498 MB/s
Random seeks: 765.1/s

(This device didn't expose the write cache configuration, as explained above.)

So a good disk array still beats a single SSD. In a few weeks, we are expecting an SSD RAID setup (yes, RAID from big-name vendor, SSDs from shop down the street), and I plan revisit this test then.

Check the approximate prices of these configurations:

plain-old hard disk: < 100 €
X25-E 64 GB: 816.90 € retail, 2-5 weeks delivery
RAID 10: 5-10k €

For production database use, you probably want at least four X25-E's in a RAID 10, to have some space and reliability. At that point you are approaching the price of the big disk array, but probably pass it in performance (to be tested later, see above). Depending on whether you more deperately need space or speed, SSDs can be cost-reasonable.

There are of course other factors to consider when comparing storage solutions, including space and energy consumption, ease of management, availability of the hardware, and reliability of the devices. It looks like it's still a tie there overall.

Next up are some pgbench tests. Thanks Greg for all the performance testing instructions.

(picture by XaYaNa CC-BY)

Friday, July 3, 2009

Do you have a good idea for PostgreSQL?

Do you have a feature request or a good idea for how to improve PostgreSQL? Subscribing to a mailing list is too cumbersome? Or you did subscribe to a mailing list and presented your idea there, and now it's rotting in the archives or on the todo list? Let's try something different.

I have set up a feedback forum over at UserVoice. There you can enter your ideas, comment and vote on other suggestions, and see what the top requests are, without the overhead of tracking a mailing list every day. Let's consider this an experiment. It is not going to replace the existing project communication channels, and you shouldn't send bug reports or engage in coding discussions there. But the more people raise their "user voice" there and provide useful suggestions and comments, the more useful it might become.

To try it, go to: http://postgresql.uservoice.com/

The floodgates are open for development on PostgreSQL 8.5, so now is the time to make yourself heard.

(Disclaimer: UserVoice is a commercial company. I am not associated with them. This is just an attempt to find better interfaces for user feedback.)

(picture by NoNo^Q8 CC-BY)

Thursday, July 2, 2009

Where have all the translations gone?

If you have downloaded PostgreSQL 8.4.0 and are wondering where so many of the translations have gone: The translation team has decided not to ship translations anymore that are not translated at least about 80%. (See the commit message for the list of victims.) This is so that incidental users of stale translations are not presented with a confusing and distracting mix of translated and untranslated messages all the time. So right now we are only shipping a full or almost full set of translations into German, Spanish, French, Japanese, Portuguese, and Turkish.

To get the translations into other languages back into the release, go to http://babel.postgresql.org/ and start submitting updates. Updates may be included as early as release 8.4.1 in a few months.

I hope in particular that we might get the Chinese, Italian, and Russian translations back into shape.

By the way, if you want to start (or continue) translating, I suggest that you approximately follow this priority order: libpq, psql, pgscripts, pg_dump, initdb, postgres. This or a similar order will make the translations useful to the most users with the least amount of work.

Wednesday, July 1, 2009

10+5 Things (Not Yet) Rendered Obsolete By My Mobile Phone

I don't have an iPhone, but I figured when reading about the 15 Apps Rendered Obsolete By The New iPhone 3GS that while my mobile phones have over time obsoleted several gadgets, tools, and other non-phone items, there are still a lot of things they don't do for me, but conceivably should. At the moment, I have a Nokia E71, which is a pretty good phone. But here are a few things it could do better:

Calendar. I still carry a separate small pocket calendar for my personal and business appointments. Mostly because I haven't bothered to figure out how to synchronize the phone calendar with the calendar at the office. And then, I don't think it's easily possible to sync my work appointments with the calendar system at work but back up my personal appointments to MyNokia Backup (or is it Ovi now), without sending the somewhat confidential work appointments to Nokia and the somewhat confidential private appointments to the office. I think you can exclude certain events from being synced at all, but not sync certain events only to certain servers. And as a minor point, the phone calendar doesn't know the local holidays. The calendar issue is obviously a core competency of the mobile phone, so it might just be matter of time before it's solved. Best guess: End of the year, when I need to decide whether to buy another calendar book.
Notes. The phone has a Notes application (and an Active Notes application, whatever). But there's the synchronization issue again; how do I get the notes from the phone into my action and project lists (which I don't keep on the phone). I do use the phone to take notes on the train and during meetings, say, but then I transfer them manually to a more trusted container.
GPS receiver. I keep a separate GPS receiver from Garmin for geocaching. The E71 has a GPS receiver and a maps application, and I do use it for navigation in the car, and I have recently discovered the SportsTracker application. But the simple interface of entering coordinates and having an arrow point there that the Garmin provides does not appear to be available on the phone. I do know that some people go geocaching with only their phones, so this must be possible, but I haven't had time to figure it out yet. But this gadget might become the next one to go.
Flashlight. This one ties in with a recent geocaching experience. Many people have tried to use their mobile phone as a light source in a bind and noticed that it doesn't work so well. Perhaps they could put in a few LEDs and a battery that is two millimeters thicker. That would help a lot.
Laser pointer. If they can do flashlight, they can surely do laser pointer. Useful for presentations, when you don't want to carry around an extra gadget.
Bluetooth mouse. Also for presentations, how about just using the phone as a next-slide clicky thing. This is likely already possible with some custom software on both the phone and the laptop, but it could be easier.
Camera. I'm not an expert photographer by any means, and I haven't done any great analysis of the picture quality of the built-in camera, but it just feels better to take a separate digital camera when you actual go out with the purpose of taking pictures.
USB drive. Yeah, you can do this with the cable and then taking care to copy things into the right directory on the phone's memory card or something. Could be easier and clearer, though. And separate from the phone's internal file system. Putting a full-size USB plug into the phone is probably going to be difficult, but for larger models like the Nokia E90 it could be possible.
Debian. Nothing like being able to hack up your own operating system and applications as you choose. It's sad that the Openmoko/FreeRunner initiative has had setbacks. And while Android is an open platform, there is something to be said for having the same operating system on all devices.
Desktop computer. Yeah right. Most of the "productivity" applications on the phone still suck and are at best stopgaps until you get home to the big screen. More on that in a later post.

OK, those weren't 15 things, as with the iPhone. For completeness, how about 5 things that the phone does appear to have replaced permanently:

Address book. I used to have one, but all the current addresses of relevance are in the phone, backed up to somewhere on the Internet. In fact, I could probably find most people I know either via public mailing list archives or something like Facebook anyway.
Land line phone. The stationary phones are gone for good. There is now only this one phone, which is the home phone, the work phone (also no more desk phones at the office), and the VoIP phone. I found this review of Google Voice an interesting contrast in this context. I only have one phone number anyway. (The VoIP line has a different number, but there you want the separate number to make sure you are using the cost-saving path to call.)
Car satnav system. For the few times I have needed it, the satellite navigation system in the phone has done fine. It's not quite the same as having one built into the car, in terms of ease of use, GPS reception, and integration with the other audio equipment, but it works. And you can use the same system to also navigate on walks, as opposed to drives, which is a surprisingly useful feature.
MP3 player. I have never really listened much to MP3s anyway, and now on the phone I have podcasts, video podcasts, internet and broadcast radio, as well as MP3s, which is much better altogether.
iPhone. :-) Don't have one, don't want one. I recommend the Hacking the iPhone session from 25C3 (video page) for further enlightenment.

Thursday, June 25, 2009

Weltkulturerbe weg, Kultur bleibt

So, jetzt ist es also passiert. SPIEGEL ONLINE berichtet, dass die UNESCO Dresden den Weltkulturerbetitel aberkannt hat, wegen der Waldschlösschenbrücke. Ist vielleicht besser so, denn jetzt kehrt hoffentlich Ruhe ein. Man konnte die ewigen Diskussionen ja nicht mehr hören.

Man kann die Brücke ja gut oder schlecht finden. Was ich aber nicht nachvollziehen kann ist Folgendes. Das Dresdner Elbtal war ja auf der Liste des Weltkulturerbes, nicht etwa des Weltnaturerbes; das gibt es auch. Kultur also im Sinne von vom Menschen Erschaffenes, nicht nur Musik und Malerei oder sowas. Das, was die UNESCO als schützenswert auserkoren hatte, waren also nicht die schönen grünen Elbwiesen, sondern das Elbtal von Pillnitz bis Übigau, mitsamt Parks, Gebäuden, Brücken, Werften und Schiffen, so die Begründung für die Aufnahme. Die Begründung erwähnt, dass verschiedene Errungenschaften vom 16. bis zum 20. Jahrhundert im Dresdner Elbtal versammelt sind. Warum muss aber die Kultur im 21. Jahrhundert aufhören? Die Menschheit entwickelt sich schließlich weiter, und die Zeichen, die diese Entwicklung setzt, können und sollen auch in eine außerordentliche Kulturlandschaft einfließen. Im Dresdner Elbtal wurden schon diverse Brücken gebaut, die teilweise wie das "Blaue Wunder" selbst Sehenswürdigkeiten sind und sogar ausdrücklich von der UNESCO erwähnt sind. Die neue Waldschlösschenbrücke ist vielleicht nicht die schönste Brücke der Welt, aber sie ist ein Ausdruck der Kultur der Gegenwart.

Der andere Aspekt, der diese Geschichte zur Farce machte, ist das Lamentieren der hohen Politik und nationalen Prominenz. Fakt ist: Die Dresdner Bürger haben in einer Volksabstimmung für die Brücke gestimmt. Und selbst nachdem die UNESCO angefangen hatte, ernsthaft damit zu drohen, den Titel abzuerkennen, haben in mehreren Umfragen deutliche Mehrheiten gemeint, dass sie trotzdem wieder für eine Brücke stimmen würden. Warum das so ist, sei dahingestellt. Vielleicht ist den Dresdnern der schnelle Weg zu Arbeit oder zum Flughafen wichtiger. Oder man hat erkannt, dass das kulturell wertvolle Blaue Wunder eigentlich ein Verkehrsnadelöhr ist. Aber die Bürger haben entschieden und das gilt es zu akzeptieren. Die Bürger können ja später auch mal dafür stimmen, die Brücke wieder abzureißen.

Monday, June 22, 2009

Praise to LXDE

I've been know to rant at times about crappy Linux desktop software. So here's a praise for a change: a praise to LXDE.

You know when the name of a piece of technology starts with "Simple" or "Lightweight", it's really not? SMTP, SNMP, LDAP? Well, LXDE proves that wrong.

I first heard about LXDE from Klaus Knopper's presentation at Chemnitzer Linux-Tage 2009, where he described why he chose LXDE as the new default desktop for his well-known Knoppix live CD/DVD/environment.

I have never really been interested in the debate of late about how short we can possibly get the boot time of a Linux machine/desktop, since I've been using suspend-to-disk for years everywhere. I only reboot for kernel upgrades, and possibly when the wakeup gets botched, which happens about once a month, I guess. But that's a different matter. What I was interested in was a lightweight desktop environment to run inside a VirtualBox.

So LXDE. apt-get install lxde gives you the whole thing. Well, the only thing that's missing is a web browser, which you will have to select and install yourself. I had reported previously on the amusing quest to find a "lightweight" web browser.

Unlike some geek-enabled minimal desktop environments, LXDE doesn't surprise the average user with an unusual layout. You have a taskbar at the bottom (can be moved to the top, for those used to GNOME), with a menu button, buttons for file manager, terminal, browser, minimize all windows; on the right, there is a CPU meter, a clock, a screen lock and a logout button. At least on Debian, the terminal and browser buttons call x-terminal-emulator and x-www-browser, respectively, so whatever browser you choose to install, it will work. There is also an image viewer and a handful of minor tools and settings available through the menu. And you can start random commands with Alt+F2. But that's more or less it.

Another thing that is interesting about LXDE is that it makes "ps" useful again. Under KDE, a "ps x" on relatively idle desktop produces what feels like 50 processes. Under LXDE it shows 15, which includes the terminal, bash, and ps processes to produce the listing. And it looks like with a bit of effort that number could be reduced even further.

Even though it's quite small, LXDE supports freedesktop standards. It uses desktop files, dbus, openbox as window manager, and supports compiz if you want. You can easily run GNOME or KDE applications, and they will behave reasonably. Once you do, however, they will of course load their lot of libraries and daemons, and the lightweightness will be gone. So on a workstation desktop, where you might want to run a graphical mail program, a calendar application, network manager, update notifier, and so on, LXDE will probably not buy you much. But on netbooks and virtual boxes, where all you need is a browser and a shell, this is a great alternative that is usable by everyone, and one that is true to its lightweight attribute. So far.

Friday, June 19, 2009

SlideShare and Creative Commons

Today I uploaded all the presentations that I have held at tech conferences over the years to SlideShare. (At least those that were in a reasonable format. I was a user of DocBook Slides in the early days, which produces HTML, which you can't upload to SlideShare.) That was quite a trip through history for me; 6 years of traveling, meeting people, pondering ideas.

I have also gone through and licensed all presentations and this blog under a Creative Commons license. For various reasons I chose different variants of Creative Commons for different bits, but in the future I plan to mostly use the CC-BY variant, which is much like the BSD license in spirit.

Saturday, May 23, 2009

PGCon 2009: Fourth Day - The File System Strikes Back

Leading topics among the talks this year at PGCon were evidently database version control, testing, and deployment methodology questions and answers, as I mentioned yesterday, as well as "queueing". Although you need to realize that there are actually two separate applications of the word queueing floating around: one being the message broker related usage, the other related to queueing theory for predicting database performance. Anyway, if you are queueing, you are doing something right this year. In general, I am amazed every year how many participants appear to have the same set of issues, and then a completely different same set of new issues the next year. So at least we are either all completely off track or mostly on the same right track.

In the lightning talks I announced the availability of postgresqlfs, a small project of mine that I theorized about many years ago and which I finally managed to implement within basically two days. If you liked PL/sh, this is the new deal for you. In fact, postgresqlfs allows you to execute your PL/sh functions directly from the shell, which is what you probably should have done in the first place. ;-)

Friday, May 22, 2009

PGCon 2009: Third Day - A New Hope

So this is going to be the new world order. Check your database into a version control system. No, not your .sql files, your database! Check it out, do changes, check it in. Wait, before you check it in, run your test suite. No, not your application test suite, your database test suite! Kind of how you develop your other code, right? Right? Together, Post Facto and pgTAP, and the spirit they represent, might be the most sensible things since the invention of the file system. And there appear to be one or two or three more talks on the program about comparing and consolidating and de-messing your databases, so that appears to be a theme this year. It's about time we organize this. Thanks guys.

Thursday, May 21, 2009

PGCon 2009: Second Day

The developer meeting turned out to be very useful, I thought. We decided to divide the PostgreSQL community into two groups for the next release cycle: one group works on hot standby, one group on synchronous replication. Everyone, please pick a camp and help out. These features are arguably the top adoption issues for PostgreSQL now, and we don't have enough people working on them.

My body clock is still out of whack. I wake up at 5 in the morning. Seems to be a common problem among Europeans, I gather. A good time to hack. Slony-I versions 1.2.16 and 2.0.2 are now uploaded to Debian. There you go.

Wednesday, May 20, 2009

PGCon 2009: First Day

So we're back in Ottawa once more. The trip has been getting smoother over the years. Back in the residence tower overlooking the city. Met some friends and colleagues at the Royal Oak. Noted that they sell Coors under Canadian beers. Baseball on TV. Mmh ... baseball ...

Before everyone asks: I'm not running the Ottawa Marathon or anything else this year. I just ran a half-marathon in Helsinki last week, and that is enough for this month.

The developer meeting is next. I am ready for six and a half hours of Git and anti-Git bashing. ;-)

Monday, May 18, 2009

Regression test code coverage reports

I have been collecting monthly PostgreSQL regression test code coverage reports at <http://developer.postgresql.org/~petere/coverage/>. So if you are wondering what this thing is but haven't had the courage to try it out yourself, there is your chance. (Hmm, buildfarm integration could be nice, someday.)

We have had a line coverage rate of about 66% steadily for a few months now (well, it's feature freeze). The lcov tool labels that as "green" (=good). The new version of lcov, which I have in use as of the April report, also reports function coverage, where we have about 73%, which lcov labels as "red" (=bad).

For the next release cycle, I have two goals in this area: First, expand the test coverage reporting to the entire source tree, not only the backend. And second, improve the test coverage of various neglected areas. There is reduced coverage, for example, in the areas of non-btree indexes, vacuuming, recovery, GEQO; and once we analyze other parts of the source tree, we will probably find gaping holes there.

Wednesday, May 13, 2009

The Big Shots

As the occasional thinker about open-source development practices, communities, and issues, I have been wondering for a while: What are the largest open-source projects? What projects have the most code, the most users, and the most issues to deal with, and how do they cope?

The Debian archive should provide some insights into the first one or two questions, as it contains a very large portion of all available and relevant open-source software and exposes them in a fairly standard form. In the old days one might have gotten out grep-dctrl to create some puzzling statistics, but nowadays this information is actually available in an SQL database: the Ultimate Debian Database (UDD). (And it's in PostgreSQL. And it comes with a postgresql_autodoc-generated schema documentation. Excellent.)

So here is a first question. Well, the zeroth question would have been, which source packages have the largest unpacked orig tarball, but that information doesn't seem to be available, either via UDD or via apt. So the first question anyway is, which source packages produce the largest installation size across all their binary packages:

udd=> SELECT source, sum(installed_size)/1024 AS mib FROM packages WHERE distribution = 'debian' AND release = 'sid' AND component = 'main' AND architecture IN ('all', 'i386') AND section <> 'debug' GROUP BY source, version ORDER BY mib DESC LIMIT 30;                                                                                   
     source       | mib                                                                                                                                   
------------------+------                                                                                                                                 
openoffice.org    | 1797                                                                                                                                  
kde-l10n          |  648                                                                                                                                  
gcj-4.4           |  544                                                                                                                                  
vtk               |  465                                                                                                                                  
linux-2.6         |  404                                                                                                                                  
openclipart       |  353                                                                                                                                  
vegastrike-data   |  311                                                                                                                                  
ghc6              |  308                                                                                                                                  
gclcvs            |  303                                                                                                                                  
wesnoth           |  300                                                                                                                                  
fpc               |  269                                                                                                                                  
axiom             |  256                                                                                                                                  
webkit            |  255                                                                                                                                  
gcc-snapshot      |  255                                                                                                                                  
lazarus           |  241                                                                                                                                  
kdebase-workspace |  226                                                                                                                                  
plt-scheme        |  221                                                                                                                                  
torcs-data-tracks |  219                                                                                                                                  
scilab            |  213
openscenegraph    |  211
eclipse           |  210
sagemath          |  201
insighttoolkit    |  198
acl2              |  195
kdebindings       |  181
atlas             |  165
gcl               |  163
trilinos          |  153
paraview          |  153
asterisk          |  144
(30 rows)

This produces a few well-known packages, but also a number of obscure ones. If you look closer, many of them appear to be themed around scientific, numerical, visualization, Scheme, Lisp, that sort of thing. Hmm.

Here is another idea. Take a package's installation footprint and multiply it by its popularity contest installation count. So you get some kind of maintenance effort score, either because the package is large or because you have a lot of users or both.

SELECT rank() OVER (ORDER BY score DESC), source, sum(installed_size::numeric * insts) AS score FROM packages JOIN popcon USING (package) WHERE distribution = 'debian' AND release = 'sid' AND component = 'main' AND architecture IN ('all', 'i386') GROUP BY source, version ORDER BY score DESC LIMIT 30;
rank |           source            |    score
-----+-----------------------------+-------------
   1 | openoffice.org              | 12638492332
   2 | mysql-dfsg-5.0              |  3411344560
   3 | eglibc                      |  3371485240
   4 | perl                        |  3019183024
   5 | evolution                   |  2669948000
   6 | samba                       |  2308923872
   7 | mesa                        |  1853902860
   8 | texlive-base                |  1684245516
   9 | gcj-4.3                     |  1610495484
  10 | foomatic-db-engine          |  1608178104
  11 | foomatic-db                 |  1423947704
  12 | inkscape                    |  1413910080
  13 | qt4-x11                     |  1258220636
  14 | gcc-4.3                     |  1248741312
  15 | kdelibs                     |  1021058256
  16 | gnome-applets               |   998434136
  17 | xulrunner                   |   958232688
  18 | coreutils                   |   954766896
  19 | openssl                     |   877067672
  20 | ncurses                     |   827679424
  21 | python2.5                   |   815826384
  22 | aptitude                    |   808161380
  23 | gimp                        |   786015124
  24 | gnome-utils                 |   781756328
  25 | nautilus                    |   774319690
  26 | openoffice.org-dictionaries |   761075576
  27 | eclipse                     |   756072380
  28 | dpkg                        |   736626200
  29 | openclipart                 |   731244240
  30 | wine                        |   707967500
(30 rows)

(Yeah, they run this thing on PostgreSQL 8.4 beta 1.)

I noticed linux-2.6 is suspiciously absent because of a low popcon score (?!?).

I don't want to dump the entire database into this blog post, but if you try this yourself you can look at about the first 200 to 300 places to find reasonably large and complex projects before it gets a bit more obscure. A few highlights:

  51 | gnupg                       |   455660464
  59 | php5                        |   386417572
  60 | mutt                        |   381148176
  83 | icu                         |   258602756
  84 | xorg-server                 |   255186332
 101 | exim4                       |   224857700
 107 | openssh                     |   215792828
 113 | tar                         |   201520400
 114 | postgresql-8.3              |   196844584
 115 | libx11                      |   195856564
 116 | ruby1.8                     |   194681656
 272 | emacs22                     |    62047476

This is obviously still biased in a lot of ways, but it does show the major projects.

The UDD is also an interesting use case that shows how you can deploy a PostgreSQL database as a semi-public service with direct access. A great tool, and a great tool to build other great tools on top of.

Europawahl

Ich habe letzte Woche die Briefwahlunterlagen zur Europawahl bekommen. Als Briefwähler hat man ja etwas Zeit, sich den Stimmzettel etwas genauer durchzulesen. So trifft man gelegentlich alte Bekannte wieder. Zum Beispiel sehe ich auf Platz 6 der Liste der CDU

Sabine Verheyen, Hausfrau, Aachen

Das ist die selbe Frau Verheyen, vermute ich ganz stark, die vor ein paar Jahren, also ich noch in Aachen wohnte, die Oberbürgermeisterwahl gegen

Jürgen Linden, Oberbürgermeister, Aachen

verloren hatte. Tipp vom einfachen Wahlvolk: Bei allem Respekt für Hausfrauen, schreiben Sie doch was anderes auf den Wahlzettel. Das sieht besser aus. Im Notfall so wie der Platz 2 auf der Liste DIE LINKE: "Angestellte". Geht doch.

Auf dem Stimmzettel befinden sich hier bei mir insgesamt 31 Wahlvorschläge. Vereinigungsfreiheit ist ja eine tolle Sache, aber man kann sich ja auch mal absprechen. Vielleicht könnten ja

10. Volksabstimmung, und
26. Für Volksentscheide

das nächste Mal gemeinsame Sache machen?

Und vielleicht finden sich auch unter

18. 50plus
22. Die Grauen
30. Rentnerinnen und Rentner Partei
31. Rentner-Partei-Deutschland

das nächste Mal Schnittpunkte?

Der eigentliche Knaller sind ja aber die Berufe der Kandidaten auf der Liste der Piratenpartei Deutschlands: Dipl.-Wirtschaftsmathematiker, Dipl.-Physiker, selbst. IT-Unternehmer, Informatiker, Geschäftsführer, Student, Student, Programmierer, Web-Entwickler, Consultant. Alles klar. ;-)

Dieses Mal gibt es übrigens wieder den Wahl-O-Mat. Den hatten Sie ja letztes Mal wohl gerichtlich verboten, weil er nicht alle Parteien berücksichtigt hatte. Dieses mal sind fast alle 31 dabei. Toll ...

Tuesday, April 28, 2009

Choosing wisely

The upshot of last week's database news is: You have to make prudent choices when selecting the software you use. No one knows what MySQL, Sybase, Informix, Greenplum, Kickfire, or EnterpriseDB, to name a few, are going to do next week. If you have bet your business on selling a proprietary software application that links against MySQL, you may be in trouble soon. Just like MySQL has been more or less in at least serious threat of trouble for some years now by selling a proprietary version of their software that links against someone else's proprietary version of their software (InnoDB). Or, to leave the database realm for one sentence, by building their development process on a proprietary version-control system.

The question isn't so much whether Oracle will kill MySQL or MySQL will kill itself. Database deployments are very slow to go away, and MySQL's adoption is massive. The question is what will happen to the investment you have made when you chose MySQL. Switching to a different SQL DBMS is extremely painful. Drizzle may now have a BSD-licensed client library, but it isn't going to be a replacement for MySQL, unless you targeted your application for MySQL 3.23. (And Drizzle development, for what it's worth, is hosted on a proprietary hosting platform.) The other forks of MySQL aren't going to help you because they can't sell you license and they are not even allowed to provide you with an updated MySQL documentation.

Of course I will argue that PostgreSQL is somehow positioned more appropriately here, with the BSD license and the way the community works. But this episode shows that open-source software isn't just about free downloads or a fun community experience; it is about protecting your investments, your IT strategy, and ultimately your business.

Tuesday, April 21, 2009

Oracle, Sun, MySQL

So Oracle is to buy Sun now. Well, that makes sense. I guess I jumped ship just in time then. People started congratulating me yesterday that I avoided being an Oracle employee. :-) Although I was de facto a MySQL employee for a while; not sure which is worse.

John Dvorak, possibly of of the most tracked-back men these days, apparently knew all along that the acquisition of MySQL by Sun was just a proxy deal for Oracle. Now with IBM nearly snatching up Sun, I guess Oracle had to react.

The question inevitably arose, what will this mean for PostgreSQL. Probably not much. For one thing, you'd be overestimating the impact that Sun has on anything these days. ;-) The acquisition of MySQL by Sun eventually led to the (very small) PostgreSQL business at Sun being phased out, and it is possible that whatever is left will now fade away even more. But it really hasn't impacted the PostgreSQL community much, and won't do so now.

On the contrary, open-source databases, both MySQL and PostgreSQL, have always (at least since they became credible) been considered the anti-Oracle proposition. Now with MySQL actually owned by Oracle, this makes PostgreSQL the primary alternative. Considering the sometimes irrational but not insignificant ill-will that the Sun acquisition has spread among open-source database enthusiasts, it is easy to imagine how even worse this will impact MySQL's reputation on perception alone.

One might of course also think that Oracle now has a devilish plan to corner the database market from both sides. But it would be too presumptuous to assume that Oracle spends $7 billion on an anti-PostgreSQL plan. In some way, this basically only continues the path that Oracle started on when it bought Sleepycat and InnoDB a few years ago, and the impact of that on the open-source database field was pretty minimal.

I don't expect that MySQL will be "killed" either. It is too valuable as a "foot in the door" in any case. Moreover, much of the MySQL momentum already lies outside of Sun anyway, in the hands of Percona, Open Query, the Monty Program, Drizzle, and others, so killing MySQL is already impossible for a single company. Which is probably a good situation for the extended open-source database community.

What about the rest of Sun? I assume Oracle will be quite interested in Solaris and Java, of course, also Open Storage, Glassfish, and some of the other pieces of middleware that no one outside Sun really knows (yay, Sun marketing). Some of these pieces nicely fill the gaps in Oracle's offerings, reduce the interdependencies of Oracle with IBM and/or solidify the already existing relationships with Sun. The cloud computing initiatives might also be of interest, as Oracle has on occasion tried to become an Internet-enabled company no matter how silly the approach (remember the Network Computer or Oracle 8i/9i?). And as many have commented, keeping OpenOffice.org around just to annoy Microsoft is probably going to be worthwhile even if it doesn't make any money. I won't be surprised, however, if the hardware business, meaning SPARC on the one side, and StorageTek on the other, will be sold to someone else.

What I will be interested in seeing is whether Oracle can do a better job exploiting the vertical integration story that Sun has been been pushing for basically two decades but has repeatedly failed to deliver on. (Like MySQL not building on SPARC (at some point), and JavaFX not being available for OpenSolaris.)

I do feel sorry for the people at MySQL, who will have to readjust to a new employer and probably a new strategy and a few more sets of reorgs for the second time in less than two years. And all this a day before the MySQL Conference, which will now probably be completely distracted by this news. Add the conspicuous absence of any mention of MySQL in the official announcements, and the uncertainty couldn't be greater.