lightweight database migrations using PostgreSQL

2015-12-30 · Computing

Update 2016-01-26: lightweight database migrations using PostgreSQL 2 details an update to the ideas outlined in this post, fixing various issues and streamlining the overall approach.


How can I run database migrations in a simple manner, with minimal dependencies, and using the database itself rather than a separate utility? To explore this, I use PostgreSQL. Not wanting to use anything particularly exclusive, I opt for PostgreSQL functions using SQL rather than PL/pgSQL. The latter would likely reduce duplication, but I don’t mind a little duplication so long as it’s centralised and gives me something in return. I’m also quite happy with using common utilities like cat and psql.

creating migrations

stage 0: initialisation

We need a way of storing metadata for migrations. We could use a dedicated table, but I opt for a more generic _meta table in case I want to store other metadata directly in the database in a similar format.

-- database/0_init.sql

CREATE TABLE IF NOT EXISTS _meta (
    key     VARCHAR     PRIMARY KEY,
    value   VARCHAR
);

INSERT INTO _meta (key, value) VALUES ('version', 0);

We could run this separately once, or just include it along with the migrations, if we’re not worried about the INSERT failure on subsequent runs. (If you’re running with psql -1, this would be a problem, but I’m specifying transactions explicitly.)

We create some temporary functions, which our migrations will be able to use:

-- database/0_init.sql

CREATE FUNCTION pg_temp.version() RETURNS integer AS $$
    SELECT value::integer FROM _meta WHERE key = 'version';
$$ LANGUAGE sql;

CREATE FUNCTION pg_temp.version(integer) RETURNS boolean AS $$
    UPDATE _meta SET value = $1 WHERE key = 'version';
    SELECT TRUE;
$$ LANGUAGE sql;

CREATE FUNCTION pg_temp.migrate(integer) RETURNS boolean AS $$
    SELECT CASE WHEN pg_temp.version() = $1 - 1 THEN NULL ELSE FALSE END;
$$ LANGUAGE sql;

The pg_temp.version() function gets the current migration version. We need to cast the type, because of the generic _meta table. The pg_temp.version(integer) function sets a migration version. It returns boolean to be compatible with how we’ll run the migrations, and will indicate that the migration actually executed. The pg_temp.migrate(integer) function returns whether a particular migration should be run. It should only be run if it is strictly the next migration in the sequence. This deals with excluding migrations which have previously been run, and whilst protecting against running migrations in the wrong order accidentally.

stage 1: migrations

We create an example migration:

-- database/1_migrate_1.sql

CREATE FUNCTION pg_temp.migrate_1() RETURNS boolean AS $$
    INSERT INTO _meta (key, value) VALUES ('tmp-1', 1);
    
    SELECT pg_temp.version(1);
$$ LANGUAGE sql;

This is the format that each of our migrations should be in: in its own file if you prefer; numbered consecutively starting from 1; wrapped in a temporary function named with the migration number suffix; returning boolean, with the final statement a SELECT calling the migration version setter. The INSERT INTO statement is merely an example of migration work, here.

stage 2: execution

We need a way of calling the migration functions:

-- database/2_exec.sql

BEGIN;
    SELECT
        pg_temp.version() "version_f",
        COALESCE(pg_temp.migrate(1), pg_temp.migrate_1()) "migrate_1",
        pg_temp.version() "version_t"
    ;
COMMIT;

We wrap this in a transaction, and chain the calls in the correct order. With PostgreSQL, even DDL statements will get rolled back if the transaction fails. The calls to the pg_temp.version() function report the starting and ending migration versions. The COALESCE() handles only running a migration if it should be run next and hasn’t been run already, using lazy-evaluation of parameters.

We make a small script to ensure we don’t forget which order to call the files in, returning the entire SQL code to be executed:

# database.sql.sh

#!/bin/sh

cat \
        database/0_init.sql \
        database/1_migrate_*.sql \
        database/2_exec.sql

This uses cat to concatenate the files in correct order of stages: 0, 1, 2. It doesn’t matter which order the individual migrations in stage 1 are concatenated, as they only define functions; it’s only in stage 2 they’re conditionally executed.

running migrations

Finally, we run the migrations against the database, using psql -x to transpose the output to list one line per migration:

database.sql.sh | psql -x
CREATE TABLE
INSERT 0 1
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
BEGIN
-[ RECORD 1 ]
version_f | 0
migrate_1 | t
version_t | 1

COMMIT

The CREATE TABLE and INSERT lines indicate that the _meta table was initialised. The CREATE FUNCTION lines indicate that our migrations have been loaded. The RECORD 1 shows that the migration version started at 0, ran migrate_1, and ended at 1.

If we run the migrations again:

database.sql.sh | psql -x
CREATE TABLE
ERROR:  duplicate key value violates unique constraint "_meta_pkey"
DETAIL:  Key (key)=(version) already exists.
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
BEGIN
-[ RECORD 1 ]
version_f | 1
migrate_1 | f
version_t | 1

COMMIT

The _meta table initialisation gives an error, but we can ignore that. The RECORD 1 shows that the migration version started at 1, didn’t run migrate_1, and ended at 1.

concluding thoughts

I like the simplicity of this approach, with SQL files able to be used without being wrapped in another language, and requiring only cat and psql as dependencies. I don’t like the repetition in database/2_exec.sql, although it does have its advantages in being able to define a custom name for each migration to display in the final report. Alternatively, a more complex function using PL/pgSQL could likely be used, wrapping the calls in a loop. The use of side-affects when calling using SELECT is rather hacky; perhaps I’d prefer it split into multiple statements. That would also make me more comfortable about evaluation-order guarantees of parameters; COALESCE provides the necessary assurances, but I’m not sure about SELECT. We could wrap a single transaction around the outside of everything using psql -1, but I’d be worried that I’d forget to type that one day and run migrations outside of a transaction by mistake.

The code in this post is also available in my coding notebook.