Skip to main content

lightweight database migrations using PostgreSQL - 2

In lightweight database migrations using PostgreSQL I outlined some ideas about running database migrations in a simple manner, with minimal dependencies, using the database itself in combination with common language-agnostic utilities. In this post, I detail an update to the ideas previously outlined, fixing various issues and streamlining the overall approach.

creating migrations

stage 0: pre

For the stage previously named initialisation (0_init.sql), I create a stage called pre (0_pre.sql). Although almost all of the code in this file remains the same, this is not simply a rename, but rather a slight alteration in how I’m thinking about the code which runs before and after the migrations. Previously, I considered this stage as just a setup phase, but now I consider this more as one end of a wrapper which surrounds the migrations-proper. However, the only code change is to move the transaction beginning and reporting of the current migration version into this file.

-- database/0_pre.sql

CREATE TABLE IF NOT EXISTS _meta (
    key     VARCHAR     PRIMARY KEY,
    value   VARCHAR
);

INSERT INTO _meta (key, value) VALUES ('version', 0);


CREATE FUNCTION pg_temp.version() RETURNS integer AS $$
    SELECT value::integer FROM _meta WHERE key = 'version';
$$ LANGUAGE sql;

CREATE FUNCTION pg_temp.version(integer) RETURNS boolean AS $$
    UPDATE _meta SET value = $1 WHERE key = 'version';
    SELECT TRUE;
$$ LANGUAGE sql;

CREATE FUNCTION pg_temp.migrate(integer) RETURNS boolean AS $$
    SELECT CASE WHEN pg_temp.version() = $1 - 1 THEN NULL ELSE FALSE END;
$$ LANGUAGE sql;


BEGIN;

SELECT pg_temp.version() "version_f";

stage 1: migrations

For the migrations-proper stage, I previously defined a unique, temporary function, leaving the execution until the last stage. The idea was that the function would only be called if needed, and in the correct order. But this failed in some scenarios, where defining the function itself depended on something which didn’t exist in the database, causing the function definition to be invalid, thus breaking the migration chain. In such a case, it did not matter about the later stage, as the overall process had already failed.

Instead, I change the skeleton of each migration to conditionally execute the migration within that file itself, requiring a strict order of cross-file concatenation (see running migrations). This also eliminates the chain of SELECT columns which was used to guarantee execution order and report on whether the migration was run. Instead, everything related to that migration is contained in the single migration file, making it easier to reason about, and reducing edits when creating a new migration using an existing one as a template (a common workflow pattern for me, I find, regardless of migration tooling or language). I also rename to use the same temporary function each time, which is created and dropped within the same file. Again, this reduces differences between multiple migration files.

I also trivially rename the migration file prefix to be shorter. If desired, a longer migration comment can be supplied as the column name, as before. Note that although this approach still requires the migration number to be specified in multiple places (because of not using variables, for simplicity), at least now those multiple places are in the same file (indeed, in the same group of lines).

-- database/1_m_1.sql

CREATE FUNCTION pg_temp.migration() RETURNS boolean AS $$


-- TODO: insert migration SQL


SELECT pg_temp.version(1);
$$ LANGUAGE sql;
SELECT COALESCE(pg_temp.migrate(1), pg_temp.migration()) "m_1";
DROP FUNCTION pg_temp.migration();

stage 2: post

For the stage previously named execution (2_exec.sql), I create a stage called post (2_post.sql). This is the other end of the wrapper which surrounds the migrations-proper. Rather than conditionally execute the migrations as before, this now simply commits the transaction and reports on the final migration version, as the migrations themselves have already been run. As before, I ensure that migrations end-to-end are wrapped within an explicit transaction, to facilitate rollback in the event of an error. As noted in the original post, with PostgreSQL, even DDL statements will get rolled back if the transaction fails, making the use of a transaction attractive.

-- database/2_post.sql

SELECT pg_temp.version() "version_t";

COMMIT;

running migrations

As noted above (see stage 1: migrations), simply using cat does not provide the ordering guarantees required, despite this split into the three separate stages. Instead, I change to perform a natural sort of the files using ls -v. Using this method, it shouldn’t be necessary to anticipate the total number of migrations in advance and zero-pad (e.g. 001 rather than 1), although that could be done if desired.

# database.sql.sh

#!/bin/sh

cd database/ && cat $(ls -v)

Peace,
tiredpixel

Code related to this post is available from my coding notebook.

Docker Compose for a Twelve-Factor Ruby Rails app

I’m very much a fan of The Twelve-Factor App methodology. Docker (with Docker Compose) can be used to build on these ideas, allowing greater control over an app’s environment and reinforcement of the principles. Here, I present some notes related to setting up Docker Compose for a (not-quite) Twelve-Factor Ruby Rails app, specifically from commits I’ve made to Tunefl, my open-source LilyPond mini-score engraving and sharing for musicians. Note that the Docker Compose documentation has an example of containerising a Ruby Rails app. However, that installs gems as root and executes with elevated privileges. Besides, I have something more complex in mind! :)

preliminary thoughts

I tend to be cautious about changing from approaches which work well, so have kept an interested eye on the many touted advantages of using Docker without immediately rushing to convert everything. Much like any tool, including a screwdriver, Docker can be used for a variety of approaches to solve a problem. I’m not convinced that using Docker automatically means a solution is better than one without it or without explicit containerisation. However, there are a few specific things offered by Docker for development which became too tempting for me to resist:

  • version-locking of external service dependencies: Having a specific, repeatable recipe for library dependencies is vital for long-term stability and sanity. I’ve been increasingly trying to get into the habit of explicitly documenting all external service versions in READMEs (e.g. PostgreSQL 9.3). Docker allows these dependencies to be enforced.

  • virtualisation security: If you’re developing within a single organisation, isolated development zones might not be a concern. But if you’re working across multiple security contexts, such as for multiple clients, this becomes something you should consider seriously. If it’s possible to git pull some company’s code and start what you think is a local webserver, only to find a mistake or malicious modification causes another company’s code or keys to be sent over the network, you’ve got a problem. Virtualisation, such as using VirtualBox, protects against this. Docker makes this virtualisation less obtrusive.

  • networking security: Even if you’re running code in isolation, it’s important to remember the security of external services. If you simply access all your PostgreSQL databases under the same user, even setting a password won’t protect you against an app connecting to the wrong database. This could be something quite accidental even within the same app, such as wiping the wrong database when running a automated test suite. That alone is one reason why I prefer to access dev and test databases with different users and passwords, sometimes adding additional protection such as requiring an extra environment variable (e.g. CI=1) to be set for the test stack. Docker not only allows individual service instances to be separated, but also the entire network stack to be segmented, stopping not only one organisation’s project from accessing another’s, but also your test stack from accessing your dev stack.

  • mount restrictions: This is something I’ll confess I don’t usually bother with when developing. But Docker allows easy configuration of volumes, including mounting all of the working directory as read-only and granting specific exceptions. This means I can be sure than an app is not doing something unpleasant like writing a runtime file when I’m not expecting it (breaking when behind load-balancing, not to mention not being part of my server backup strategy).

docker-compose.yml

Let’s work inwards, starting from an existing Procfile definition of internal services:

# Procfile

web:            bundle exec thin start -p $PORT
worker:         bundle exec sidekiq

Tunefl also has a .env.example, designed to be copied to .env and adjusted (and copied to .test.env, etc. for multi-stack setups). This is the app’s central settings file in environment variable form, containing external service connections such as DATABASE_URL. I don’t use config/database.yml files or similar. We can continue to use these settings with only minimal changes, also keeping it easy for those who wish to install without Docker.

We’ll be using the newer networking method rather than the Docker container links method, so we need to change hostnames; e.g.:

# .env.example

DATABASE_URL=postgresql://tunefl_dev:password@tunefl_postgres_1.tunefl/tunefl_dev
REDIS_URL=redis://tunefl_redis_1.tunefl:6379/0

For the web service, we set the build context as the current directory, point to the templated .env, mount the current directory as read-only, declare Port 8080 within the container, accessible outside the container on an ephemeral port, and specify the web service previously defined in Procfile (PORT isn’t available here, so we specify it explicitly; the port is only used within the container, so this doesn’t worry me too much).

# docker-compose.yml

---
web:
  build: &build "."
  env_file: &env_file ".env"
  volumes:
    - ".:/srv/tunefl/:ro"
  ports:
    - "127.0.0.1::8080"
  command: "bundle exec thin start -p 8080"

For the worker service, we use the same build context and .env, using YAML node anchors and references to ensure settings are kept in sync (we could also use Docker Compose extends syntax), use a non-default Dockerfile as the worker requires additional packages to be installed, connect to the same volumes as web, and specify the worker service previously defined in Procfile. It’s worth noting that although volumes are shared, here, that’s only to facilitate default Tunefl local-storage setup (rather than AWS S3 or similar), as well as to make gem installation faster.

# docker-compose.yml (continued)

worker:
  build: *build
  env_file: *env_file
  dockerfile: "Dockerfile.worker"
  volumes_from:
    - web
  command: "bundle exec sidekiq"

For the postgres and redis services, we lock to particular versions, and use the images’ defaults.

# docker-compose.yml (continued)

postgres:
  image: "postgres:9.3.10"
redis:
  image: "redis:3.0.5"

I also used the opportunity to change the app to write logs only to STDOUT rather than also log/ directory files, but this wouldn’t be necessary for an app already following Twelve-Factor properly. Also because of the read-only mount and some difficulties with db/schema.rb getting changed on gem installation from the repository version, I disabled the rake db:schema:dump task, which is automatically called by rake db:migrate (this is an old Rails 3 app, so config.active_record.dump_schema_after_migration isn’t available, and the rake db:schema:load method doesn’t work for this app).

Dockerfile

Next, we need a Dockerfile to define how to build the image. In this case, we’ll actually create two: Dockerfile containing the main definition for the web service, and Dockerfile.worker containing the definition for the worker service, which requires additional packages to be installed. Where possible, we’ll speed up the builds by considering carefully the order of the Docker layers, and using a shared volume for gem installation. For an app with fewer services or less complex package dependencies, one Dockerfile would likely be sufficient.

We use SYNC comments, which have no special meaning, to denote which sections of Dockerfile and Dockerfile.worker should be kept in sync, either for Docker layer optimisation (SYNC: Dockerfile/1), or because it’s a shared pattern (SYNC: Dockerfile/2).

Dockerfile—web service

We base the build on the official Docker ruby:2.2.3 image, install packages needed for gem installation, and create a user:

# Dockerfile

# SYNC: Dockerfile/1 {
FROM ruby:2.2.3
RUN \
    apt-get update -y && \
    apt-get install -y \
        build-essential \
        libpq-dev && \
    useradd --home-dir /srv/tunefl/ --shell /usr/sbin/nologin tunefl
# SYNC: }

Next, we create the skeleton structure within our writeable volumes, and set ownership. It’s probably not necessary to create every one of these directories, but I had a list handy from Capistrano deployments, so I used this. Note that we don’t actually chown /srv/tunefl itself, as the default permissions allow the unprivileged user read access, and that’s all we need.

# Dockerfile (continued)

RUN \
    mkdir \
        /srv/tunefl.bundle/ \
        /srv/tunefl/ \
        /srv/tunefl/public/ \
        /srv/tunefl/public/assets/ \
        /srv/tunefl/public/system/ \
        /srv/tunefl/public/uploads/ \
        /srv/tunefl/tmp/ \
        /srv/tunefl/tmp/cache/ \
        /srv/tunefl/tmp/pids/ \
        /srv/tunefl/tmp/sockets/ && \
    chown -R tunefl \
        /srv/tunefl.bundle/ \
        # not /srv/tunefl/
        /srv/tunefl/public/ \
        /srv/tunefl/tmp/

Next, we copy the library dependency definitions. These aren’t owned by the unprivileged user either, ensuring that these definitions cannot accidentally change when we install the gems.

# Dockerfile (continued)

COPY [ \
    "Gemfile", \
    "Gemfile.lock", \
    "/srv/tunefl/"]

We set the working directory and become the unprivileged user, which will be used both to install the gems and to run the web service itself. We define BUNDLE_APP_CONFIG to point to a writeable volume owned by the unprivileged user. Then, we install the library dependencies.

# Dockerfile (continued)

# SYNC: Dockerfile/2 {
WORKDIR /srv/tunefl/
USER tunefl
ENV BUNDLE_APP_CONFIG /srv/tunefl.bundle/
# SYNC: }

RUN bundle install --path /srv/tunefl.bundle/

Finally, we declare the writeable volumes: /srv/tunefl.bundle to contain installed gems, and /srv/tunefl/public/ and /srv/tunefl/tmp/ to be mounted over the top of read-only /srv/tunefl. Ideally, we’d switch tmp/ to another location entirely, but this doesn’t appear to be straightforward with Rails 3.

# Dockerfile (continued)

VOLUME [ \
    "/srv/tunefl.bundle/", \
    "/srv/tunefl/public/", \
    "/srv/tunefl/tmp/"]

Dockerfile.worker—worker service

We keep the SYNC: Dockerfile/1 section in sync with that in Dockerfile, rather than modifying it to add our additional packages. This enables the Docker layer cache to be used, significantly increasing the speed of builds.

# Dockerfile.worker

# SYNC: Dockerfile/1 {
FROM ruby:2.2.3
RUN \
    apt-get update -y && \
    apt-get install -y \
        build-essential \
        libpq-dev && \
    useradd --home-dir /srv/tunefl/ --shell /usr/sbin/nologin tunefl
# SYNC: }

Next, we install the additional packages needed for the worker service. The lilypond package has lots of dependencies, and installation takes a long time. This fragment is placed so as to maximise the shared layers between Dockerfile and Dockerfile.worker. However, web does not need lilypond, so we keep that build clean, which is the whole motivation for the separate Dockerfile.worker.

# Dockerfile.worker (continued)

RUN \
    apt-get update -y && \
    apt-get install -y \
        lilypond

Finally, we keep the SYNC: Dockerfile/2 section in sync with that in Dockerfile, simply to reuse the shared pattern. We don’t benefit from the Docker layer cache cross-Dockerfiles here, as the previous steps have diverged. Note that we don’t need to set up any directories or install any gems, as we’re reusing the web service volumes. This means that gem installation should still be fast, because the work has already been done in Dockerfile.

# Dockerfile.worker (continued)

# SYNC: Dockerfile/2 {
WORKDIR /srv/tunefl/
USER tunefl
ENV BUNDLE_APP_CONFIG /srv/tunefl.bundle/
# SYNC: }

usage

In this section, I note some common usage commands. I’m using Docker Compose 1.5.2, which means the --x-networking flag is required to activate the automatic handling of a segregated project-specific bridge network. If you’re using Docker Compose 1.6 onwards, you probably won’t need this flag.

To build and start all services, both internal and external:

docker-compose --x-networking up

To view all running containers and their names:

docker ps

To open a browser pointed to whichever ephemeral port Docker has connected the web service to (you can also map to host ports explicitly, if you prefer):

xdg-open "http://$(docker-compose port web 8080)" # Linux

To connect to PostgreSQL:

docker exec -it tunefl_postgres_1 psql -U postgres

To connect to Redis:

docker exec -it tunefl_redis_1 redis-cli

To migrate a database using rake db:migrate:

docker exec tunefl_web_1 bundle exec rake db:migrate

To monitor Resque or Sidekiq jobs from the command-line using my Sidekiq Spy:

docker exec -it tunefl_worker_1 sh \
    -c 'TERM=xterm bundle exec sidekiq-spy -h tunefl_redis_1.tunefl -n resque'

More usage notes, including an approach to handling multiple stacks such as for executing an automated test suite, can be found in the Tunefl README.

Peace,
tiredpixel

Code related to this post is available from the Tunefl repository.

rethinking analytics and social media sharing

I find myself continuing to rethink the placement of analytics and social media sharing scripts on the sites I control. Over recent years, this behaviour has become almost automatic for me: create a site or web application, add Google Analytics scripts, add AddThis buttons to facilitate sharing on Twitter and other channels. But I’m realising more and more that we’ve somehow got into a situation where privacy-related decisions are often taken with far too little time for reflection, sometimes supported by the all-too-weak argument that everyone else is doing it. For some parties, I suspect we might have got into this position by design, but for others, it might be more a case of not having adequately thought about it, of not having fully considered what we might be giving up as well as gaining.

In November 2015, I removed Google Analytics and AddThis from Tunefl, my open-source LilyPond mini-score engraving and sharing service for musicians. I also began a more serious review of other sites I’m involved with. Today, I also removed the Twitter embedded-tweet scripts from this blog. Sure, it means that my shared photos don’t auto-display, and that it’s more work for someone to share any content they find interesting (!). But more important to me is that the site is more respectful of users who run JavaScript, and that privacy tools such as Privacy Badger indicate that my site doesn’t appear to be participating in tracking-related activities. And it’s also faster, because of avoiding the need for the remote JavaScript load.

Regarding analytics, the removal of Google Analytics from my site isn’t to say that data isn’t important to me. High-quality data can lead to high-quality decisions (although there is no guarantee of this). But with data collection comes great responsibility. I haven’t had chance to fully review the previous solution. And it’s not even the complete picture anyway, missing not only many non-browser visits such as by web crawlers, but also visits from users who have JavaScript disabled or have privacy-related extensions activated (as often do I). To get a fuller picture of the data, I could independently parse the web server logs (ignoring here the potential privacy implementations of keeping those logs in the first place). I’m not actually doing that at the moment, but it’s certainly something I’ll be thinking more about.

Some might think it would have been better to have kept things as they were until I’d spent more time reviewing the situation and getting alternative solutions in place. But I think it’s often best to set out in the approximate direction you realise you want to go, even without having the whole route planned. Otherwise, it’s far too easy to stay exactly where you are, and keep putting changes off until another time. For a client project, I might not necessarily advise the same approach; that would need careful consideration (and wouldn’t just be my decision, obviously). But really, in the bigger picture, my blog and open-source sites really aren’t that important—not enough to justify doing something I’m uncomfortable with. For the meanwhile, if that means I have less insight into visitors, I think I’m actually okay with that.

I’m not saying I won’t continue to use such analytics and social media sharing scripts in the future, even for sites I don’t have to compromise for, but that at least I would have given myself more time to consider the options carefully before doing so, and, hopefully, to find a solution which yields benefits without potentially contributing negatively to privacy. I’m not necessarily recommending that others follow suit; I think what’s most important here is to think carefully about these things rather than sleepwalk into them, and, if you have the opportunity, to err on the side of caution if your present self is not sure what to do.

Peace,
tiredpixel

lightweight database migrations using PostgreSQL

Update 2016-01-26: lightweight database migrations using PostgreSQL 2 details an update to the ideas outlined in this post, fixing various issues and streamlining the overall approach.


How can I run database migrations in a simple manner, with minimal dependencies, and using the database itself rather than a separate utility? To explore this, I use PostgreSQL. Not wanting to use anything particularly exclusive, I opt for PostgreSQL functions using SQL rather than PL/pgSQL. The latter would likely reduce duplication, but I don’t mind a little duplication so long as it’s centralised and gives me something in return. I’m also quite happy with using common utilities like cat and psql.

creating migrations

stage 0: initialisation

We need a way of storing metadata for migrations. We could use a dedicated table, but I opt for a more generic _meta table in case I want to store other metadata directly in the database in a similar format.

-- database/0_init.sql

CREATE TABLE IF NOT EXISTS _meta (
    key     VARCHAR     PRIMARY KEY,
    value   VARCHAR
);

INSERT INTO _meta (key, value) VALUES ('version', 0);

We could run this separately once, or just include it along with the migrations, if we’re not worried about the INSERT failure on subsequent runs. (If you’re running with psql -1, this would be a problem, but I’m specifying transactions explicitly.)

We create some temporary functions, which our migrations will be able to use:

-- database/0_init.sql

CREATE FUNCTION pg_temp.version() RETURNS integer AS $$
    SELECT value::integer FROM _meta WHERE key = 'version';
$$ LANGUAGE sql;

CREATE FUNCTION pg_temp.version(integer) RETURNS boolean AS $$
    UPDATE _meta SET value = $1 WHERE key = 'version';
    SELECT TRUE;
$$ LANGUAGE sql;

CREATE FUNCTION pg_temp.migrate(integer) RETURNS boolean AS $$
    SELECT CASE WHEN pg_temp.version() = $1 - 1 THEN NULL ELSE FALSE END;
$$ LANGUAGE sql;

The pg_temp.version() function gets the current migration version. We need to cast the type, because of the generic _meta table. The pg_temp.version(integer) function sets a migration version. It returns boolean to be compatible with how we’ll run the migrations, and will indicate that the migration actually executed. The pg_temp.migrate(integer) function returns whether a particular migration should be run. It should only be run if it is strictly the next migration in the sequence. This deals with excluding migrations which have previously been run, and whilst protecting against running migrations in the wrong order accidentally.

stage 1: migrations

We create an example migration:

-- database/1_migrate_1.sql

CREATE FUNCTION pg_temp.migrate_1() RETURNS boolean AS $$
    INSERT INTO _meta (key, value) VALUES ('tmp-1', 1);

    SELECT pg_temp.version(1);
$$ LANGUAGE sql;

This is the format that each of our migrations should be in: in its own file if you prefer; numbered consecutively starting from 1; wrapped in a temporary function named with the migration number suffix; returning boolean, with the final statement a SELECT calling the migration version setter. The INSERT INTO statement is merely an example of migration work, here.

stage 2: execution

We need a way of calling the migration functions:

-- database/2_exec.sql

BEGIN;
    SELECT
        pg_temp.version() "version_f",
        COALESCE(pg_temp.migrate(1), pg_temp.migrate_1()) "migrate_1",
        pg_temp.version() "version_t"
    ;
COMMIT;

We wrap this in a transaction, and chain the calls in the correct order. With PostgreSQL, even DDL statements will get rolled back if the transaction fails. The calls to the pg_temp.version() function report the starting and ending migration versions. The COALESCE() handles only running a migration if it should be run next and hasn’t been run already, using lazy-evaluation of parameters.

We make a small script to ensure we don’t forget which order to call the files in, returning the entire SQL code to be executed:

# database.sql.sh

#!/bin/sh

cat \
        database/0_init.sql \
        database/1_migrate_*.sql \
        database/2_exec.sql

This uses cat to concatenate the files in correct order of stages: 0, 1, 2. It doesn’t matter which order the individual migrations in stage 1 are concatenated, as they only define functions; it’s only in stage 2 they’re conditionally executed.

running migrations

Finally, we run the migrations against the database, using psql -x to transpose the output to list one line per migration:

database.sql.sh | psql -x
CREATE TABLE
INSERT 0 1
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
BEGIN
-[ RECORD 1 ]
version_f | 0
migrate_1 | t
version_t | 1

COMMIT

The CREATE TABLE and INSERT lines indicate that the _meta table was initialised. The CREATE FUNCTION lines indicate that our migrations have been loaded. The RECORD 1 shows that the migration version started at 0, ran migrate_1, and ended at 1.

If we run the migrations again:

database.sql.sh | psql -x
CREATE TABLE
ERROR:  duplicate key value violates unique constraint "_meta_pkey"
DETAIL:  Key (key)=(version) already exists.
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
BEGIN
-[ RECORD 1 ]
version_f | 1
migrate_1 | f
version_t | 1

COMMIT

The _meta table initialisation gives an error, but we can ignore that. The RECORD 1 shows that the migration version started at 1, didn’t run migrate_1, and ended at 1.

concluding thoughts

I like the simplicity of this approach, with SQL files able to be used without being wrapped in another language, and requiring only cat and psql as dependencies. I don’t like the repetition in database/2_exec.sql, although it does have its advantages in being able to define a custom name for each migration to display in the final report. Alternatively, a more complex function using PL/pgSQL could likely be used, wrapping the calls in a loop. The use of side-affects when calling using SELECT is rather hacky; perhaps I’d prefer it split into multiple statements. That would also make me more comfortable about evaluation-order guarantees of parameters; COALESCE provides the necessary assurances, but I’m not sure about SELECT. We could wrap a single transaction around the outside of everything using psql -1, but I’d be worried that I’d forget to type that one day and run migrations outside of a transaction by mistake.

The code in this post is also available in my coding notebook.