Thursday, December 2, 2010

Development time

Let's say you want to contribute to PostgreSQL development and want to play by the rules (which are actually not rules, but guidelines or encouragements), such as:
  1. While a release is in beta, you work on finalizing the release, not on future projects.
  2. During a commitfest, you work on testing and integrating the proposals submitted for the commitfest, not on new features.
  3. Major features should not be submitted for the first time at the last commitfest.
The final release of PostgreSQL 9.0.0 was on 2010-09-20. By that time you already missed the first commitfest (2010-07), and the second commitfest (2010-09) was already under way. The second commitfest was actually slightly delayed and ended on 2010-10-26, whereas the third and next-to-last-for-9.1 commitfest (2010-11) started punctually on 2010-11-15.

That means if, while being a team player on all the community efforts, you wanted to develop a major new feature for PostgreSQL 9.1, you had a total of about 20 days to do it. (That is, if you didn't spend several days in early November at PgWest.) And that is within a one-year release cycle.

Thursday, November 4, 2010

pipefail

It is widely considered good style to include
set -e
near the beginning of a shell script so that it aborts when there is an uncaught error. The Debian policy also recommends this.

Unfortunately, this doesn't work in pipelines. So if you have something like
some_command | sed '...'
a failure of some_command won't be recognized.

By default, the return status of a pipeline is the return status of the last command. So that would be the sed command above, which is usually not the failure candidate you're worried about. Also, the definition of set -e is to exit immediately if the return status of the last pipeline is nonzero, so the exit status of some_command isn't considered there.

Fortunately, there is a straightforward solution, which might not be very well known. Use
set -o pipefail
With pipefail, the return status of a pipeline is "the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully". So if some_command fails, the whole pipeline fails, and set -e kicks in. So you need to use set -o pipefail and set -e together to get this effect.

This only works in bash, so if you're trying to write scripts that conform to POSIX or some other standard, you can't use it. (There are usually other ways to discover failures in pipelines in other shells, but none are as simple as this one, it appears.) But if you are writing bash anyway, you should definitely use it. And if you're not using bash but use a lot of pipelines in your scripts, you should perhaps consider using bash.

(Hmm, it looks like there could be a number of latent bugs in the existing Debian package maintainer scripts, because this issue appears to be widely ignored.)

Monday, November 1, 2010

Git User's Survey 2010 Results

The results of the Git User's Survey 2010 are up.

Not many surprises, but I can see how this sort of survey is very useful for the developers of Git.

Tuesday, October 5, 2010

Git User's Survey 2010

The Git User's Survey 2010 is up. Please devote a few minutes of your time to fill out the simple questionnaire; it'll help the Git community understand your needs, what you like about Git (and what you don't), and help improve it.

The survey is open from 1 September to 15 October, 2010.

Go to https://git.wiki.kernel.org/index.php/GitSurvey2010 for more information.

Wednesday, July 28, 2010

PostgreSQL and Flymake

Flymake is an on-the-fly syntax checker for Emacs. Here is how you can use it for editing the PostgreSQL source code. Add this to your src/Makefile.custom:
check-syntax:
        $(COMPILE.c) -fsyntax-only $(CHK_SOURCES)
Then you can activate the flymake minor mode either manually using M-x flymake-mode, or by adding it to your Emacs customization for PostgreSQL. I use:
(defun pgsql-c-mode ()
  "C mode adjusted for PostgreSQL project"
  (interactive)
  (c-mode)
  (flymake-mode)

  ; .. and the rest ...
)
Here is a screenshot:
Notice the marked line with the syntax error and the tooltip/popup with the error message, which appears when the mouse hovers over the marked line.

Note, however, that since activating flymake will essentially cause a compiler to be running continuously in the background, this is not what you want to use when hacking PostgreSQL on the road. ;-)

Saturday, July 3, 2010

Increasing the priority of Debian experimental

Many people run Debian testing or unstable or some mix thereof.  This works pretty well for a development system or a general desktop system if you know a bit about what you're doing (note: nonetheless officially not recommended).  Sometimes you throw packages from experimental into the mix, if you want to get the latest stuff that isn't yet fully integrated into the rest of Debian.

The default APT priority of the Debian experimental release is 1, which ensures that it is never automatically installed or upgraded. This is not always ideal, in my experience. Of course, you don't want a package from experimental to take precedence over a package from stable, testing, or unstable by default. But I think when you have in the past installed a package from experimental, you probably want to pull in upgrades to that package from experimental as well. Otherwise, you will end up with completely unmaintained packages on your system.  That is because in practice many packages in experimental are not actually experimental or broken or unmaintained, but just an advance branch of some software that is just for some reason not ready to go down the unstable-testing-stable road.

To make this work better, I have set the priority of experimental to 101 in my /etc/apt/preferences:
Package: *
Pin: release experimental
Pin-Priority: 101
Now the following will happen: If you just apt-get install a package, it will come from whatever "normal" release you have in your sources.list, say stable or testing. You can override that using -t experimental as usual. If you install a package from experimental and later an upgrade is available in experimental, apt-get upgrade will install that automatically. Also, if an upgrade in a "normal" release appears that has a higher version number, that version will be installed.

Of course, caveats apply. Some software in experimental is really experimental and should only be installed under close supervision. If a package is available only in experimental, this setup will install it when you ask for the package, even if you might not have actually wanted it if you had known that it was in experimental. Figure it out yourself. :)

Similar considerations apply to backports. I use
Package: *
Pin: release a=lenny-backports
Pin-Priority: 102
On the system I have in mind here, the standard distribution is stable, testing is 101, and backports is 102, taking precedence over testing. Because for some architecture-independent packages you don't need backports, so you can pull them directly from testing that way.

In general, the APT priority business is relatively powerful and often a good alternative to, say, manually downloading packages from various distributions, installing them manually, forgetting where they came from, and never upgrading them.

Wednesday, June 30, 2010

An Update On Views

A long time ago, in a text book far away, someone described the following (relational) database design approach: Put all your data in tables, normalized and so on, as you'd expect.  An top of each table, create a view.  The database clients only access the views.  As the application evolves, tables might change, more views will be added, but the interface that the existing views present to the outside stays the same.  Sounds nice; everything stays backward compatible.

In order for this to work, views need to be updatable.  Updatable views have been standardized in SQL as far back as 1992, and have been implemented (partially, possibly) in some of the popular/successful SQL implementations.  Curiously, however, this feature has evaded PostgreSQL altogether so far.  Conventional wisdom has held it that views can be made updatable by hand using the rewrite rules system, and a proper implementation of automatically updatable views would only need to create the rules automatically.  Debate and coding around this idea has been taking place for many years, mostly inspired by the heroic efforts of Bernd Helmle.  But I think the conclusion of those in the know by now is that the rewrite rules system is inappropriate for this task (and in fact others have argued that the rewrite rules system is inappropriate for any task and should be ripped out; perhaps harsh but not far from the truth in my opinion).  So the updatable views effort is at square one (or zero) at the moment.

In the meantime, updatable views have been conspicuously absent from any of the recent top-needed features list for PostgreSQL.  Which basically means no one really cares (anymore).  Why is that?

My observation is that the typical use of views has changed quite a bit of the years (decades), and has decreased altogether.  The idea of using views as transparent interfaces on top of tables never really took off to begin with.  Using views to encapsulate complex queries is still reasonable, but not terribly common.  Part of the reason is that many queries are generated, either by ORMs or by custom code.  The other reason is that server-side functions provide a much more powerful abstraction mechanism.  They can't do everything, in particular inlining of queries doesn't alway work, but they can do a lot more with parameters and general procedural programming.  Views are also used for access control, either for columns or for rows.  Column security can now be had with explicit column privileges in PostgreSQL, and one could argue that using views for this was a workaround all along.  And using views for row access control doesn't actually really work, as it turns out.  Proper row security might be an upcoming feature, but it will need to work differently.  And anyway, superfinely granular access control is rather rare in SQL databases overall, I guess.  And in all of these cases, the views don't need to be updatable.

Views are occasionally used as a thin wrapper around a table-returning function, so the function can be access like a table.  But this only works when reading.  Another use, somewhat related actually, is defining views for monitoring the database system or the database application.  Because those views are often used interactively, it is
important that they are convenient to access, and so hugely complex queries combined with custom functions or possibly query generation are inappropriate.

On the other hand, materialized views are all the rage now, appearing at the top of the requested PostgreSQL feature list, and having been fixtures in the feature lists of other SQL implementations for some time.  While traditional views are a tool for organizing code and interfaces, materialized views are purely a performance optimization tool (which, by the way, is the reason materialized views are not
specified in the SQL standard).  Depending on how you implement materialized views, they might even be usable implicitly, if the query matches the materialized query.  Then, materialized views wouldn't really be views at all anymore, they're just query caches.

So, what do you think?  Are traditional views an obsolescent, marginal feature?  Is the matter of updatable views obsolete?

Friday, May 28, 2010

System-Versioned Tables

After my report on the upcoming SQL:2011, some people had asked me about the system-versioned table feature that is going to be the arguably only major new feature there. Here is how it works:
CREATE TABLE tab (
    useful_data int,
    more_data varchar,
    start timestamp GENERATED ALWAYS AS SYSTEM VERSION START,
    end timestamp GENERATED ALWAYS AS SYSTEM VERSION END
) WITH SYSTEM VERSIONING;
(This hilariously verbose syntax arises because this is defined so that it fits into the more general generated columns feature, e. g., GENERATED ALWAYS AS IDENTITY, similar to PostgreSQL's serial type.)
INSERT INTO tab (useful_data, more_data) VALUES (...);
This sets the "start" column to the current transaction timestamp, and the "end" column to the highest possible timestamp value.
UPDATE tab SET useful_data = something WHERE more_data = whatever;
For each row that would normally be updated, set the "end" timestamp to the current transaction timestamp, and insert a new row with the "start" timestamp set to the current transaction timestamp. DELETE works analogously.
SELECT * FROM tab;
This only shows rows where current_timestamp is between "start" and "end". To show the non-current data, the following options are
possible:
SELECT * FROM tab AS OF SYSTEM TIME timestamp;
SELECT * FROM tab VERSIONS BEFORE SYSTEM TIME timestamp;
SELECT * FROM tab VERSIONS AFTER SYSTEM TIME timestamp;
SELECT * FROM tab VERSIONS BETWEEN SYSTEM TIME timestamp AND timestamp;
There's also the option of
CREATE TABLE tab ( ... ) WITH SYSTEM VERSIONING KEEP VERSIONS FOR interval;
to automatically delete old versions.

That's more or less it. It's pretty much xmin/xmax/vacuum on a higher level with timestamps instead of numbers. And it's a revival of the old time travel feature. Obviously, you can do most or all of this with triggers already.

Wednesday, May 19, 2010

PostgreSQL Package Management

Database development sucks, and I would like to fix it. I don't mean database system software development; that's awesome. ;-) I mean database development as in writing the tables, views, functions, and other code that make up your database. We have come further in recent years in PostgreSQL land. We have had the PL/pgSQL debugger, there is Piggly for PL/pgSQL code coverage, we have pgTAP for unit testing support, we have in Post Facto a version control system running inside the database, although I'll try to explain below why I think that that is not the right solution. My problem is that getting database code from the editor to the database server in a safe manner is pretty difficult. This already starts with deploying simple database code to a single server for the first time (as shown in this entry), and gradually gets more complicated when you need to update existing installations, manage multiple servers, or even multiple versions of that code.

My answer to that problem is an old friend: package management. Package managers such as dpkg and rpm are pretty well-established solutions and have shown over the years that managing software deployments can be easy and occasionally even fun.

Consider how the development and deployment of normal software proceeds:
  • You edit some files in your text editor.
  • You execute some kind of build step (could be compilation or more trivial).
  • Maybe you run a test suite.
  • You check your files into version control.
And then you or someone else, perhaps a release manager or system administratior does something like this:
  • Tag a release.
  • Export the tag from version control.
  • Build the software.
  • Run tests.
  • Build a package.
  • Push package to repository.
  • Install/upgrade package on deployment machines.
There are of course sites that do this differently. One could, for example, just copy the built files onto the target machines. But if you do that, then you might as well be happy with the current state of SQL deployment. Many sites are going for continuous deployment nowadays, in which case you might not need a way to track what version of the code is currently running, but you will still want a way to make deployments and upgrades in a safe manner. My premise is that the development cycle of database software should work in pretty much the same way as for other software. But we are only slowly getting the tools to support this work flow. We have editors that allow editing SQL, of course. We don't really need any new build tools; the existing ones appear to serve just fine. We have testing support, getting better all the time. At this point it was once thought that perhaps a database-based version control system such as Post Facto would address the remaining issues. After considering that for a while I have figured out that it doesn't. The main problems with that approach are:
  • Abuses version control system as a deployment tool. Some people deploy code by svn checkout, but most don't.
  • Does not allow for a build step (in a straightforward manner). The database is the source code. You can't generate your database from some XML source, for example.
  • I consider it preferable to put database code and related non-database code in the same version control system, so they can be developed, browsed, deployed, etc. together.
A database-based version control system might actually be the thing for some people, just like other version control systems serve different needs. (Except, note that mine is better than yours! ;-) )

So I think what the world needs is a package manager for databases, that manages database objects just like a package manager for operating systems manages files. And it should work in much the same ways, like this:

  • Conceptually, there are two major modes of operation: building a package and installing a package (think rpmbuild and rpm, or dpkg-buildpackage and dpkg).
  • Building a package is controlled by a special instruction file (think RPM spec file or debian/ directory).
  • Building a package involves running that software's normal build routine and installing it into a temporary staging area. In simple cases, this is running psql -f somefile.sql, but it could be anything. You could have an ORM tool generate the database.
  • The package file is an archive file containing the contents of that staging area plus some metadata.
  • Installing a package unpacks the archive file and places the contents into the designated destination, while sensibly dealing with existing installations.
There are additional features of well-known package managers that could be of interest in a database environment: pre/post scripts for additional setup, relocations (perhaps for selecting installation schemas), integration with a repository manager (like yum or apt).

So I have set out to write such a tool and have arrived at a small prototype. Here is how it works:

$ pgpkg build
 pgpgk/rules clean
make: Nothing to be done for `clean'.
 pgpgk/rules build
make: Nothing to be done for `build'.
 pgpgk/rules binary
psql -X -q -f install_foo.sql
pgpkg create
wrote ../foo_1.0.pgpkg
This build process is currently driven by a Debian-style rules and control file. An RPM-style spec file might also be possible.
$ pgpkg contents ../foo_1.0.pgpkg
{'Client-Encoding': 'UTF8', 'Version': 1.0, 'Format': '0.0', 'Package': 'foo'}
CREATE FUNCTION "public"."test1"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 1$$;
CREATE FUNCTION "public"."test2"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 2$$;
CREATE FUNCTION "public"."test3"(integer, integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 3$$;

$ pgpkg install -d 'dbname=bar' ../foo_1.0.pgpkg

$ pgpkg remove -d 'dbname=bar' foo
There are more features planned, such as diffing packages, but I haven't gotten to those yet. Also, don't get too excited yet, it only handles a very small subset of SQL objects. Much more code needs to be added to handle everything, but that will mostly be a typing exercise (famous last words).

Ah, but what about this extension packaging business that has been discussed for the last couple of years? Well, I have been staring at the various design documents for that same couple of years now, and it's not going anywhere and it's not really set out to solve my problem. Partially by coincidence, partially by intent, as I'm posting this the guys are sitting in Ottawa discussing extension packaging yet again, so at least take this as a challenge to get something going. :-)

Note that I am calling this thing a "package", because that's what it is. It's not (necessarily) an "extension" (although an extension could be packaged as a package), and it's not a "module" (that term is reserved by SQL; what Oracle calls a package is actually a module).

Also note that this is implemented as an external program, not in the database server. I think the operating system shell is a much more usable and versatile environment. Think of all the things you might want to do: build many packages, compare packages, install some/many/all packages, grep package contents, copy, download, or upload packages. This makes most sense from the shell. If you think about it, neither dpkg nor rpm require extensions in the kernel or the file system to do their work; the same applies here. Another advantage is that the implementation can evolve faster and doesn't need to wait for say PostgreSQL 9.1 to be out.

Anyway, for the moment, forget about extensions. Think about the database application code that you need to move around. Does the above approach make sense? Could it be useful? Ideas, suggestions?

Tuesday, May 18, 2010

Visual Explain Reloaded

One of the new features in PostgreSQL 9.0 is EXPLAIN output in different markup formats (XML, JSON, YAML). The idea is that this makes it easier for programs (as opposed to humans) to parse the output.

Using one of the standard teaching examples, compare the outputs:
regression=> EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Hash Join  (cost=227.95..700.46 rows=101 width=488)
   Hash Cond: (t2.unique2 = t1.unique2)
   ->  Seq Scan on tenk2 t2  (cost=0.00..434.00 rows=10000 width=244)
   ->  Hash  (cost=226.68..226.68 rows=101 width=244)
         ->  Bitmap Heap Scan on tenk1 t1  (cost=5.03..226.68 rows=101 width=244)
               Recheck Cond: (unique1 < 100)
               ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.01 rows=101 width=0)
                     Index Cond: (unique1 < 100)
(8 rows)
regression=> EXPLAIN (FORMAT JSON) SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;

                     QUERY PLAN
-----------------------------------------------------
 [                                                  +
   {                                                +
     "Plan": {                                      +
       "Node Type": "Hash Join",                    +
       "Join Type": "Inner",                        +
       "Startup Cost": 227.95,                      +
       "Total Cost": 700.46,                        +
       "Plan Rows": 101,                            +
       "Plan Width": 488,                           +
       "Hash Cond": "(t2.unique2 = t1.unique2)",    +
       "Plans": [                                   +
         {                                          +
           "Node Type": "Seq Scan",                 +
           "Parent Relationship": "Outer",          +
           "Relation Name": "tenk2",                +
           "Alias": "t2",                           +
           "Startup Cost": 0.00,                    +
           "Total Cost": 434.00,                    +
           "Plan Rows": 10000,                      +
           "Plan Width": 244                        +
         },                                         +
         {                                          +
           "Node Type": "Hash",                     +
           "Parent Relationship": "Inner",          +
           "Startup Cost": 226.68,                  +
           "Total Cost": 226.68,                    +
           "Plan Rows": 101,                        +
           "Plan Width": 244,                       +
           "Plans": [                               +
             {                                      +
               "Node Type": "Bitmap Heap Scan",     +
               "Parent Relationship": "Outer",      +
               "Relation Name": "tenk1",            +
               "Alias": "t1",                       +
               "Startup Cost": 5.03,                +
               "Total Cost": 226.68,                +
               "Plan Rows": 101,                    +
               "Plan Width": 244,                   +
               "Recheck Cond": "(unique1 < 100)",   +
               "Plans": [                           +
                 {                                  +
                   "Node Type": "Bitmap Index Scan",+
                   "Parent Relationship": "Outer",  +
                   "Index Name": "tenk1_unique1",   +
                   "Startup Cost": 0.00,            +
                   "Total Cost": 5.01,              +
                   "Plan Rows": 101,                +
                   "Plan Width": 0,                 +
                   "Index Cond": "(unique1 < 100)"  +
                 }                                  +
               ]                                    +
             }                                      +
           ]                                        +
         }                                          +
       ]                                            +
     }                                              +
   }                                                +
 ]
(1 row)
For just reading the plan, the new formats are probably not better, although some might prefer them. The real win comes when you can feed this to a program to create a visualization. With the old format, parsing the output was complicated and error prone. With the new formats, it is easy. In fact, it was so easy that I couldn't resist writing a small visual explain program that renders plans through the graphviz library. Here is how you can use it:
regression=> \a\t
regression=> EXPLAIN (FORMAT JSON) SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2 \g |veung
Result:

Simple and useful. Get it from GitHub.

Tuesday, May 11, 2010

MERGE Syntax

The SQL MERGE statement has gotten my attention again. For many years, PostgreSQL users have been longing for a way to do an "upsert" operation, meaning do an UPDATE, and if no record was found do an INSERT (or the other way around). Especially MySQL users are familiar with the REPLACE statement and the INSERT ... ON DUPLICATE KEY UPDATE statement, which are two variant ways to attempt to solve that problem (that have interesting issues of their own). Of course, you can achieve this in PostgreSQL with some programming, but the solutions tend to be specific to the situation, and they tend to be lengthier than one would want.

Discussions on this then usually proceed to speculate that the SQL-standard MERGE statement ought to be the proper way to solve this, and then it turns out that no one completely understands the MERGE syntax or semantics, especially as they apply to this upsert problem. (I was in that group.) And that's been the end of that so far. OK, as I write this I am pointed, via Robert Haas's blog post, to an older mailing list post by Simon Riggs, who is surely one of the men most qualified to drive an eventual implementation, that contains a hint toward the solution, but it's hard to find in that post, if you want to try.

This subject had gotten my attention again at the SQL standard working group meeting I attended a few weeks ago, where I learned that in SQL:2011, a DELETE branch has been added to MERGE. We also took some time after the official part of the meeting to work through some examples that illustrate the uses of the MERGE statement.

Let's take a look at what the MERGE statement is originally supposed to do, and where the term "merge" arose from. Let's say you have a table with outstanding balances, such as
CREATE TABLE balances (
    name text,
    balance numeric
);
and at intervals you get a list of payments that your organization has received, such as
CREATE TABLE payments (
    name text,
    payment numeric
);
What you want to do then is to "merge" the payments table into the balances table in the following way:
  • If a balance exists, subtract from it.
  • If the balance goes to zero, delete the entire record.
  • If no balance exists, create a record (maybe someone pre-paid).
The command to do this would be:
MERGE INTO balances AS b
    USING payments AS p
    ON p.name = b.name
    WHEN MATCHED AND b.balance - p.payment = 0 THEN DELETE
    WHEN MATCHED AND b.balance - p.payment <> 0 THEN UPDATE SET balance = balance - p.payment
    WHEN NOT MATCHED THEN INSERT (name, balance) VALUES (p.name, -b.payment);
Of course there are simpler cases, but this shows all of the interesting features of this command.

How does this get us upsert? There, you don't have two tables, but only one and some values. I have seen some claims and examples about this in the wild that turn out to be wrong because they evidently violate the syntax rules of the SQL standard. So I did the only sensible thing and implemented the MERGE syntax into the PostgreSQL parser on the flight back, because that seemed to be the best way to verify the syntax. So the correct way, I believe, to do, say, an upsert of the balances table would be:

MERGE INTO balances AS b
    USING (VALUES ('foo', 10.00), ('bar', 20.00)) AS p (name, payment)
    ON p.name = b.name
    WHEN MATCHED AND b.balance - p.payment = 0 THEN DELETE
    WHEN MATCHED AND b.balance - p.payment <> 0 THEN UPDATE SET balance = balance - p.payment
    WHEN NOT MATCHED THEN INSERT (name, balance) VALUES (p.name, -b.payment);
Not all that nice and compact, but that's how it works.

Note that the AS clause after VALUES is required. If you leave it off, the PostgreSQL parser complains that a subquery in FROM needs an AS clause. Which is obviously not what this is, but it uses the same grammar rules, and it makes sense in this case because you need a correlation name to join against. And it was also one of those rare moments when you implemented something that gives you correct feedback that you didn't even provide for.

Anyway, the examples above all parse correctly, but they don't do anything yet. But if someone wants to implement this further or just try out the syntax, I'll send my code.

Friday, May 7, 2010

Update: PostgreSQL doesn't really need a new Python driver

A couple of months ago, there was a brief stir in the PostgreSQL community about the state of the Python drivers and there being too many of them. While we can't really do much anymore about there being too many, the most popular driver, Psycopg, has since had a revival:
  • A new release 2.0.14 has been issued.
  • The license has been changed to GNU LGPL 3, alleviating various concerns about the previous "hacked up" license.
Of course, this doesn't yet fix every single bug that has been mentioned, but it ought to be the foundation for a viable future. The mailing list has been quite active lately, and the Git repository has seen a bunch of commits, by multiple developers. So if you have had issues with Psycopg, get involved now. And thanks to all who have made this "revival" happen.

Wednesday, April 28, 2010

Duplicate Stuff

When you attempt to create a table that already exists, you get an error. When you attempt to create an index that already exists, you get an error, unless you happened to pick a different name. Then you have two indexes that do the same thing. In PostgreSQL 9.0, index names can get generated automatically, so this might get worse. The same applies to constraints: Nothing stops you from creating the logically same constraint twice. Unless you explicitly name constraints, then you get an error. I sometimes give explicit names to check constraints just so later on I have a clue why I needed to check that x is greater than 3. But naming primary key constraints or foreign key constraints is usually a bit silly, because it's clear what they do. So it's easy to end up with 5 identical foreign key constraints.

Is it a common problem that duplicate constraints or indexes get created? I have seen this happen in different ways. You run scripts that you think are idempotent, but they are not really in this sense. (It's in the details: CREATE TABLE foo (a REFERENCES b ...) will fail cleanly when run twice. CREATE TABLE foo (a); ALTER TABLE foo ADD FOREIGN KEY ... will not. See pg_dump.)  Or logically duplicate things are created in independent ways, because say the psql \d output isn't clear enough or isn't checked. Obviously, all of these are human errors in some way.

I'm thinking that a setting to notify the user about this could be useful. It could be a warning or even an error. I can't see any possible reason in the normal course of use that someone would want two foreign keys that point exactly the same way, or two check constraints that check exactly the same thing, or two indexes that are set up in exactly the same way.

Comments? Experiences?

Monday, April 19, 2010

News from the SQL Standard

Last week, I attended the meeting of DIN NA 043-01-32 AA, which is the German "mirror group" of ISO/IEC JTC1 SC32, whereof WG3 produces ISO/IEC 9075, which is titled "Database languages - SQL".  Once you dig through all these numbers and letters and find out who is responsible for what, it's actually quite simple to get involved there and review or contribute things.

For the benefit of everyone who is interested but hasn't had the chance to get involved in that process, here is what is currently going on:
  • A new standard is currently in the "Final Committee Draft" (FCD) phase, which basically means "beta".  The final release is expected in 2011, so you will begin to see mentions of "SQL:2011".
  • The new standard will only contain parts 1/Framework, 2/Foundation, 4/PSM, 11/Schemata, 14/XML. The other parts are currently not being developed, which doesn't mean they are dead or withdrawn, but that no one bothers to add things to them at the moment.
  • All new features in SQL:2011 will be in the form of optional features.  So the level of core conformance is not impacted by the release of a new standard.
There isn't actually that much new in SQL:2011, besides countless fixes and clarifications.  I counted only a dozen or so new features.  Here are some things that might be of interest to the PostgreSQL community:
  • The syntax ALTER TABLE ... ALTER COLUMN ... SET/DROP NOT NULL has been taken into the standard.  PostgreSQL has supported that since version 7.3.  Coincidence?  Not sure.
  • Constraints can optionally be set to NO ENFORCE.  That means the database system won't enforce them but still assumes they are valid, for example for optimization.
  • System-versioned tables: Perhaps the largest new feature, this is a way to make data (rows) visible only during certain times. Some pundits might recall a moral predecessor of this feature labeled SQL/Temporal.  I haven't fully analyzed this feature yet, so I'll post later about the details.
  • Combined data change and retrieval. PostgreSQL does something like this with RETURNING, but this feature is more elaborate and allows the writing of "delta tables".
  • Named arguments in function calls. PostgreSQL 9.0 supports that, but using the syntax foo(3 AS a) instead of what ended up in the standard, foo(a => 3).
  • Default values for function arguments. PostgreSQL 9.0 supports that as well.
This time I attended the meeting as a guest. We are discussing some procedural and financial issues to make this an official membership.  If anyone else is interested in getting involved in the SQL standard development, let me know and I can point you in the right direction.  If we have enough interest, we can set up a discussion group within the PostgreSQL project.

Saturday, March 13, 2010

Running SQL scripts with psql

If you are using PostgreSQL, have you ever loaded an SQL script file or dump file through psql? Well, duh.

-f vs. <

If you are just starting out, you will probably try this:
psql mydb < dump.sql
and that's quite OK.

Once you hit your first error message such as
ERROR:  syntax error at or near "VEIW"
you might figure out that for some reason
psql mydb -f dump.sql
is better, because it produces
psql:dump.sql:56: ERROR:  syntax error at or near "VEIW"
instead, allowing you to actually find the error in your file.

Now I admit that it is almost entirely me who is to blame for this bizarre difference, because at some point in the distant past, the GNU Coding Standards recommended that programs should behave the same independent of whether the standard input or output is a terminal or a file. The current version of said standard actually explicitly creates an exception saying that error messages should be changed to the noninteractive style when the standard input is not from a terminal. So this should probably be fixed.

Note that the -f form above is not portable. It depends on the GNU getopt extension that permits options after nonoption arguments. To be portable, you need to write either
psql -d mydb -f dump.sql
or
psql -f dump.sql mydb
Frankly, I hardly ever do this because I rarely use a non-GNU system, but keep it in mind when writing scripts or documentation intended to be portable.

psqlrc

The next thing you should always do when running psql scripts is using the option -X, which prevents the reading of the .psqlrc file. Because that file could contain anything, and you have no idea how it will interact with your script.

In my mind, this is a design mistake in psql. Unix shells have different startup files for interactive and noninteractive usage, so they don't have this problem.

Quiet or Loud

Anyway, if you are restoring a dump, these commands will produce output that does something like this:
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE FUNCTION
CREATE TYPE
CREATE FUNCTION
ALTER TABLE
ALTER TABLE
CREATE INDEX
CREATE INDEX
These are the responses from the DDL commands that are in your SQL script file. I personally like to not see these, because they don't tell me anything useful and they cause the important information such as error messages to fly rapidly off the screen.

Some people might like these responses because they serve as a kind of progress indicator. Which is true, but there is a flaw. Creating tables and functions is pretty fast. Chances are that that part of the dump file will fly way off your screen in two seconds. The slow part of a database restore are the COPY commands that restore the data. And those do not produce any response at all! So as a progress report, this output is not all that useful. OK, the ALTER TABLE parts at the end that create the foreign keys can be slow, and of course the CREATE INDEX commands, but in the above example, the longest wait would be after the last CREATE FUNCTION.

There are two ways to make this behavior more sane: If you don't want the "progress" output, only the errors and other important messages, use the option -q (or --quiet). This is probably more useful for restoring a dump on the console. If you want a full progress report, use the option -a (or --all), which will in addition to the default behavior print each statement from the file before executing it. This is probably fairly useful in batch jobs where you are logging the output to a file or somewhere. But most likely you will want to use one of these two options for almost any noninteractive invocation of psql. You can also somewhat usefully use both options at the same
time; think about it.

Notices

On the matter of controlling the output, you might want to hide the NOTICE messages such as
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test1_pkey" for table "test1"
These get pretty boring quickly, especially when you create a lot of tables with a lot of primary keys.

psql doesn't provide a direct command-line option for turning this off, but there is an obscure method: See the environment variable PGOPTIONS to '--client-min-messages=warning', which will pass that option to the server process. So all together, your call might look like
PGOPTIONS='--client-min-messages=warning' psql -d mydb -q -f dump.sql

For some tastes, it might actually make sense to change the client_min_messages option permanently in postgresql.conf on the server.

Also note that analogous to the case of .psqlrc described above, PGOPTIONS could contain all kinds of settings that you don't want. So even if you don't want to disable notices as shown here, it would be most prudent to unset the environment variable in other cases.

Transactions

The next thing you might want to consider is using transactions, or rather a single transaction. Otherwise, if the execution of the script or restoration of the backup fails somewhere along the way, you have a half-loaded database and no easy restart point. Sometimes this doesn't matter that much; if you are restoring into a new database, you can just drop it and start over. If you are deploying updates into an existing system, much more care is required. The option to execute a file in a single transaction is -1 or --single-transaction.

Now depending on what exactly is in the file you want to load, using a single transaction may or may not work very well. For dump files created with pg_dump, it usually works, unless you used the -c/--clean option, in which case the DROP commands that appear at the beginning of the file might not have anything to drop and will fail, failing the entire transaction. That problem could be addressed if pg_dump used DROP IF EXISTS, but it doesn't.

When deploying schema changes into an existing database, this can get very tricky. For functions, you can usually use CREATE OR REPLACE FUNCTION to have an idempotent function creation command. Except that it will fail when the return type was changed. And there is no CREATE OR REPLACE AGGREGATE or CREATE OR REPLACE OPERATOR (yet?). With other object classes such as types or casts it can be easier to just unconditionally drop and recreate the objects. If you drop a type, however, everything that uses that type is also dropped, so be sure to recreate everything (such as functions) afterwards in the same transaction. This won't help when the type is used in tables, though; don't blow away your tables. Exactly how to manage this type of situation is a science of its own and would go beyond the scope of this post.

Side note: An alternative for some situations is using the psql option ON_ERROR_ROLLBACK, which allows you to ignore errors but still make use the atomicity property of transactions, useful when doing schema upgrades.

If you are convinced that running SQL scripts in transactions is the right thing, and you have adjusted your scripts to behave properly in that context, then you may also wish to consider the option
-v ON_ERROR_STOP=1
This causes psql to stop execution when a transaction fails. Otherwise it would continue to execute the rest of the script and issuing the error message
ERROR:  current transaction is aborted, commands ignored until end of transaction block
for every command, which makes no sense. There was a discussion a while ago about making ON_ERROR_STOP default to on when the single-transaction option is used, but it was not clear how this should interact with savepoints. If you are making use of savepoints, you may want to stay away from this option or evaluate yourself whether it makes sense for you.

Pager

If you run a psql script with the output on a terminal (which is normal when installing databases, restoring backups, etc.), the script executes a SELECT command, and the output doesn't fit on the screen, the output is run through the pager, which will normally wait for some key to be pressed to continue. This is obviously an extremely useful feature in psql in interactive mode, but it also happens when you run a script, which is dubious.

Often, this won't be a problem, because backup files for instance don't contain SELECT commands with large outputs. But you can simulate this by taking any dump file that restores a sequence, which will contain SELECT setval(...) calls. If you make your terminal window 4 lines or less, you can see the effect of this. Again, this is contrived, but every so often someone puts a SELECT command in a file to create some kind of visual verification that the tables or the functions or the data that the script was supposed to load is now really there. There is nothing wrong with that, except when you run the script from the terminal and forget to check back with extreme frequency. The old joke that someone left a transaction open before going to lunch has never been more real.

Add the option
--pset pager=off
to the psql invocation to disable this behavior.

Summary

OK, got all that? Write this down, here is how you really should execute an SQL script file in psql:

PGOPTIONS='--client-min-messages=warning' psql -X -q -a -1 -v ON_ERROR_STOP=1 --pset pager=off -d mydb -f dump.sql

Should be easier? I think so. Discuss.

Wednesday, March 10, 2010

Looking for Free Hosting

I'm looking for a way to do free hosting. But I mean free as in freedom, not free as in beer. Let me explain.

When I'm using a piece of free and open-source software such as OpenOffice.org, Evolution, or anything else, I have certain possibilities, freedoms if you will, of interacting with the software beyond just consuming it. I can look at the source code to study how it works. I can rebuild it to have a higher degree of confidence that I'm actually running that code. I can fix a bug or create an enhancement. I can send the patch upstream and wait for the next release, or in important cases I can create a local build. With the emerge of new project hosting sites such as GitHub, it's getting even easier to share one's modifications so others can use them. And so on.

As a lot of software moves to the web, how will this work in the future? There are those that say that it won't, and that it will be a big problem, and that's why you shouldn't use such services. Which is what probably a lot of free-software conscious users are doing right now. But I think that in the longer run, resisting isn't going to win over the masses to free software.

First of all, of course, the software would need to be written. So a free web office suite, a free web mail suite that matches the capabilities of the leading nonfree provider, and so on. We have good starts with Identi.ca and OpenStreetMap, for example, but we'd need a lot more. Then you throw it on a machine, and people can use it. Now as a user of this service, how do I get the source code? Of course you could offer a tarball for download, and that is the approach that the AGPL license takes. One problem with that is, if you are used to apt-get source or something similar for getting the source, everyone putting a tarball on their web site in a different place isn't going to make you happy. A standardized packaging-type thing ought to be wrapped around that. Another problem is that even if you trust the site's operator that that's the source code that's actually running on your site (even without malice, it could for example be outdated against the deployed version), it probably won't contain the local configuration files and setup scripts that would allow me to duplicate the service. And if I just want to study how the program is running in actuality, there is not much I can do.

Giving everyone SSH access to the box is probably not a good idea, and won't really solve all the issues anyway. In the future, when virtualization is standardized, ubiquitous, and awesome, one might imagine that a packaging of a web service won't be "put these files on the file system and reload a daemon" but instead "put these files and this configuration on that machine and activate it". This might give rise to a new generation of Linux "distributors". Getting the source tarball or "source package" might then involve getting a snapshot of that image, which you can examine, modify, and redeploy elsewhere. That could work for OpenStreetMap, for example, modulo the space and time required for their massive database. (But you might chose to fork only the code, not the data.) But it won't be easy to do the right thing in many cases, because with a web service, there is usually other people's data on the machine as well, which would need to be masked out or something. Maybe this really can't be done correctly, and the future will be more distributed, like in the way Jabber attempted to supplant centralized services such as ICQ. Distributed web mail makes sense, distributed OpenStreetMap perhaps less so.

Ideas anyone? Does anyone perhaps have experiences with running a web service that attempts to give users the freedoms and practical benefits that are usually associated with locally installed software?

Friday, February 5, 2010

My Favorite PostgreSQL 9.0 Feature

My favorite PostgreSQL 9.0 feature does not have a two-letter acronym. It's the new bytea format, available since 8.5alpha1.

At F-Secure Labs, as you might imagine, we store information about a bunch of malware samples. Throughout the computer security industry, file samples, malware or perhaps not, are referred to by a hash value, such as MD5, SHA1, or SHA256. The typical representation of such hash values in most programming environments and also in prose and literature is the hexadecimal format, for example da39a3ee5e6b4b0d3255bfef95601890afd80709. Except when you want to write a test case against PostgreSQL or want to track down a problem, you'll be looking for \3329\243\356^kK\0152U\277\357\225`\030\220\257\330\007\011, also known as \\3329\\243\\356^kK\\0152U\\277\\357\\225`\\030\\220\\257\\330\\007\\011 in some contexts.

Well, that's over now; it will show as \xda39a3ee5e6b4b0d3255bfef95601890afd80709. You will still need to take care of the backslash, but that will surely be resolved when standard_conforming_strings is turned on in version 10.0 ;-), or we implement a new, SQL-standard conforming binary string type without legacy syntax issues.

By the way, the actual origin of this feature idea was a performance problem, reported by Bernd Helmle. The new format is quite a bit faster to encode and decode: In some internal tests, pg_dump of tables with mostly binary data was twice as fast and created a dump file that was half the size with the new format compared to the old format. So hopefully everyone wins.

Another new feature in the same area, by the way, is that PL/Python now supports bytea values sanely, contributed by Caleb Welton.

Sunday, January 31, 2010

Going ...

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

See you there! Or maybe even there.

Wait ... I have the last slot on Saturday and the first slot on Sunday?!? Great! :^)

Monday, January 25, 2010

PostgreSQL: The Universal Database Management System

I'm glad you asked, since I've been pondering this for a while.  $subject is my new project slogan.  Now I'm not sure whether we can actually use it, because a) it's stolen from Debian, and b) another (commercial, proprietary) database product already uses the "universal database" line.

I have come to appreciate that the "universality" of a software proposition can be a killer feature.  For example, Debian GNU/Linux, the "universal operating system", might not be the operating system that is the easiest to approach or use, but once you get to know it, the fact that it works well and the same way on server, desktop, and embedded ensures that you never have to worry about what operating system to use for a particular task.  Or Python, it's perhaps not the most geeky nor the most enterprisy programming language, but you can use it for servers, GUIs, scripting, system administration, like few other languages.  It might as well be the "universal programming language".  A lot of other software is not nearly universal, which means that whenever you move into a new area, you have to learn a lot of things from scratch and cannot easily apply and extend past experiences.  And often the results are then poor and expensive.

The nice thing about PostgreSQL is that you never have to worry about whether to use it, because you can be pretty sure that it will fit the job.  Even if you don't care whether something is "open source" or "most advanced".  But it will fit the job.  The only well-known exception is embedded databases, and frankly I think we should try to address that.

Tuesday, January 12, 2010

Procedural Languages in PostgreSQL 8.5: The One That Works!

While much of the PostgreSQL hacker world is abuzz over two-letter acronyms (HS, SR, VF), I will second Andrew's post and will generalize this to say, partially tooting my own horn, of course, that the next PostgreSQL release will be a great one for procedural languages. Behold:
  • PL/pgSQL is installed by default.
  • New DO statement allows ad hoc execution of PL code.
  • PL/pgSQL finally got a sane parser.
  • PL/Perl got a shot in the arm.
  • PL/Python got saner data type handling, Unicode support, and Py3k support.
  • Not directly related, but the coming PL/Proxy features are looking promising as well.
  • (Meanwhile, language historians will be interested to know that PL/Tcl has received exactly zero feature or bug-fix commits since 8.4.)
This will be a great boost for PostgreSQL the development platform.

Monday, January 4, 2010

Remove and Purge

Debian's package manager dpkg has the perhaps unique feature that it distinguishes between removing and purging a package. Removing it removes the program files but keeps the configuration files (and sometimes the logs) around, purging it really removes everything. While this distinction undoubtedly has some uses, I have found that I almost never make use of it. I think in about six years of using Debian I have actually needed a remove-but-not-purge functionality about five times, during some really tricky upgrades (and using Aptitude instead of APT might have helped, not sure) and once when I wanted to build a package that had a build dependency that conflicted with a package I had installed (cowbuilder came later).

I think many people don't fully realize this distinction, and thus aged systems will often contain dozens or hundreds of removed-but-not-purged packages lying around. Great fun cleaning that up. And therefore, at some point in the distant past I have switched all my APTs to purge by default, using the configuration setting Apt::Get::Purge "true";. At the time I thought this would be daring, but I have never looked back. The one time a year that I don't want to purge I override this by hand.

Later, APT actually got an apt-get purge command, but there is no apt-get autopurge and no apt-get dist-upgrade-and-purge (or whatever) to purge the packages it wants to remove. This can be worked around by carefully adding --purge to all invocations of apt-get, but who will remember that. And of course apt-get remove is hardwired into my fingers.

How do other people handle this? Are there undiscovered reasons removing is the better default? How do you clean up packages that were forgotten to be purged?

Sunday, January 3, 2010

Missing Features for PostgreSQL SQL Conformance

A thought to start the new year: Perhaps it's time for the final push to complete the core SQL conformance for PostgreSQL.

Where do we stand? The PostgreSQL documentation lists in its appendix the currently supported and unsupported SQL features. As explained there, a certain subset of these features represents the "Core" features, which every conforming SQL implementation must supply, while the rest is purely optional. The unsupported features page currently lists 14 remaining Core features and subfeatures that are missing from PostgreSQL. Two of those are about client-side module support that is actually not mandatory if the implementation provides an embedded language (e.g., ECPG), so there are 12 items left.

So that's not so bad. Here's a list of the missing features:

E081-09 USAGE privilege

This would mean adding a USAGE privilege to domains.

Maybe this isn't very useful, although perhaps those working on SELinux support might have a more qualified opinion on it.  But let's say if we get all the other things done and this is left, this would be a fairly straightforward and well-defined feature to add.

(This would then complete feature E081 Basic Privileges.)

E153 Updatable queries with subqueries

This presupposes updatable views and requires views to be updatable even if their WHERE clause contains a subquery.

This is probably the big one. In the current PostgreSQL architecture, updatable views are apparently quite difficult to implement correctly. The mailing list archives contain plenty of details.

F311-04 CREATE VIEW: WITH CHECK OPTION

This also presupposes updatable views and requires the CHECK OPTION feature. See above.

(This would then complete feature F311 Schema definition statement.)

F812 Basic flagging

This feature means that there should be some implementation-specific facility that raises a notice or warning when a not standard-conforming SQL statement or clause is used. Or in other words a facility that warns when a PostgreSQL extension is used.

A naive implementation might consist of just adding something like elog(WARNING, "not SQL standard") in about five hundred places, but the trick would be to implement it in a way that is easy to maintain in the future. The mailing list archives also contain some discussions about this, key word "SQL flagger".

S011 Distinct data types

This is a way to define user-defined types based on existing types, like
CREATE TYPE new AS old;
Unlike domains, this way the new type does not inherit any of the functions and operators from the old type. This might sound useless at first, but it can actually create better type safety. For example, you could create a type like
CREATE TYPE order_number AS int;
while preventing that someone tries to, say, multiply order numbers.

The implementation effort would probably be similar to that for domains or enums. Also, search the mailing list archives for "distinct types".

(This includes feature S011-01 USER_DEFINED_TYPES view.)

T321 Basic SQL-invoked routines

There are a number of bits missing from fully SQL-compatible SQL function definitions, besides the specific subfeatures mentioned below.
  • Instead of a routine body like AS $$ ... $$, allow one unquoted SQL statement as routine body (see example below under RETURN).
  • LANGUAGE SQL is the default.
  • SPECIFIC xyz clause, allowing the assignment of an explicit "specific routine name" that can be used to refer to the function even when overloaded. Probably not terribly useful for PostgreSQL.
  • DETERMINISTIC / NOT DETERMINISTIC clause. DETERMINISTIC means the same as IMMUTABLE in PostgreSQL; NOT DETERMINSTIC is then STABLE or VOLATILE.
  • CONTAINS SQL / READS SQL DATA / MODIFIES SQL DATA clause. These also appear to overlap with the volatility property in PostgreSQL: MODIFIES would make the function volatile, READS would make it
    STABLE.
Also, for DROP FUNCTION the ability to drop a function by its "specific name" is required:
DROP SPECIFIC FUNCTION specific_name;
There are probably some more details missing, so part of finishing this item would also be some research.

T321-02 User-defined stored procedures with no overloading

Add a new command CREATE PROCEDURE that does that same thing as CREATE FUNCTION .. RETURNS void, and a DROP PROCEDURE command.

T321-04 CALL statement

Add a new command CALL procname() that does the same thing as SELECT procname() but requires procname() to not return a value, meaning it has to be a procedure in the above sense.

T321-05 RETURN statement

Add a new command RETURN callable only from within SQL functions. Then, instead of writing a function like
CREATE FUNCTION name(args) RETURNS type LANGUAGE SQL
AS $$ SELECT something $$;
write
CREATE FUNCTION name(args) RETURNS type LANGUAGE SQL
RETURN something;

That's it! Plus all the stuff I missed, of course. We only have about 2 weeks left(!) until the final commit fest for the 8.5 release, so it's a bit late to tackle these issues now, but maybe for the release after that?