let’s stop embedding config in repositories

2013-08-01 · Computing

I’m not the first person to say it, and I certainly won’t be the last. And yes, I’m sometimes guilty of it myself. I write a program—probably not going to be open-sourced, for internal use only, and not a library. So, I embed a port number (MySQL really isn’t going to be anywhere other than port 3306, right?). That file gets committed, and other developers on my team know what the connections and passwords are. That’s fine, right? Okay, as many of us know, it’s sort of ‘bad’. But why? Here are some reasons.

embedding config can make code harder to read

Whether the code is familiar or not, mixing core logic with specific implementation can make the code much harder to read. For example: Is this number something specific which much be kept the same for the program to work, or is it something which depends on an external factor? Where can the port for this service be changed—oh, it’s somewhere buried within this file. By ‘config’, I don’t just mean passwords and ports; extracting other settings (e.g. polling intervals) can make the important parts of the code more obvious. At the very least, extracting such settings to a separate file within the repository is a step in the right direction. But I think it’s better not to commit those files at all.

embedding config makes it hard to deploy multiple installations

Perhaps I’ve written some internal program, but then find this could also be used by another team. If I’ve embedded config, maintaining separate installations is troublesome. Cloning the repository pulls down all sorts of settings which are not relevant, so I have to fork to change or remove irrelevant config (assuming, of course, the credentials can be shared across teams; if not, then embedded config prevents the repository from being shared at all). With the config extracted and not committed, however, deploying code is just as easy as for the existing installation. It’s not the configuration which is of interest; it’s the core logic itself. Configuration of deployers has a similar argument; if you’re writing Ruby, for instance, perhaps you want to use Capistrano or Mina, and use a config/deploy.rb. That’s well and good—but don’t commit it! If you do, you’ve immediately made it far harder for anyone else to use the repository. Both these deployers support using arbitrary locations for deployment configuration, anyway; there really is no reason to store such within the repository.

development is just another installation

When developing, I like to run as close to production as possible; instead of seeding, I prefer to use dumps of live databases. This gives me greater assurance that the code I’m writing is going to work on another machine. But what about optimisations like caching? True, you don’t usually want that enabled in development. Less obvious is that you might not want it in production, either. Thanks to popular frameworks like Ruby on Rails, we’ve got into this habit of thinking about development, production, and maybe testing environments. But what if I want two productions—one tracking a develop branch, and another tracking a master branch for something customer-facing (maybe set up for automatic deployment on push)? I probably want caching on the customer-facing version, but I might not on the internal develop branch. But setting whether or not to apply caching really has nothing to do with the core logic itself; you might even like to enable caching to test something in development, without enabling the rest of the features usually used in production. I’ve seen much code checking RACK_ENV or RAILS_ENV—but such embedded decisions don’t scale well to multiple environments, and also make it hard to use logical names such as develop, user-test, live-beta, live. Having separate configs makes this straightforward.

but storing config in a repository is convenient

So long as you’ve got tight control over who can read and write to that repository, there’s no problem with storing config in a repository. But make it a different repository—one that you know has nothing to do with the code, and doesn’t pollute it with assumptions about config or deployers. Maybe you want to have a config repository per code repository. Maybe you want to keep all credentials for your programs in a shared repository, similar to tracking Puppet manifests. Both approaches work; the key is to keep core logic separate from specific implementation. That way, you get all the benefits, as well as the possibility of quickly changing the deployment method across multiple repositories and installations.

an approach

Currently, I tend to favour using environment variables where possible (such as can be easily read by Foreman), and JSON config files where not. I provide example config files (which I do commit), describe in the documentation which files need to be copied and modified, and ignore the target locations. Thus, setting up a new program is quick, and I can see all settings at a glance. When coding, I think: Could this repository be made public without jeopardising a live installation (business considerations ignored)? If not, then that’s probably a hint that config could be extracted. I keep the deployment configurations entirely separate to the code repositories; I disfavour the config/deploy.rb pattern.

How about you—what’s your opinion about embedding config in repositories?