planning for catastrophic failures in unattended upgrades

2020-08-31 · Computing

In the past month, I’ve had a number of unattended upgrades fail, corrupting server packages, taking them offline, and even rendering them unbootable. Despite this, services have remained online, and I’ve been able to rebuild and bring replacement servers up without downtime. Automatic updates are problematic. Bugs, incompatibilities, or even just bad luck can bring an entire system crashing down, making an automated solution less attractive. But it doesn’t have to be like this; you can plan for catastrophic failures in unattended upgrades, and mitigate risk by design.

manual vs automatic patching: a short-term vs long-term conflict

Even with the best of intentions and sufficient budgets, organisations can easily get into a situation where updates are not applied frequently, leaving their infrastructure vulnerable to a huge amount of risk. Taking humans out of the loop and forging a path forward using unattended upgrades is a way to break out of the problem. But putting aside the cost of developing and testing a reliable automated patching system, this brings whole new risks which, if not properly mitigated, can jeopardise the stability of even the best-designed system—even if it’s been designed according to high-availability principles. I think that beneath the surface, this is a conflict between short-term and long-term goals.

It’s easy to criticise failings in not having a regular patching schedule; I myself have criticised a number of times in the past, where I felt the lack of frequency with which patches—especially security patches—were applied was so bad as to not be easily justifiable, regardless of the valid reasons leading to that state of things. But this can also overlook the fact that, optimising for stability over the short-term often means not applying updates regularly, or maybe only doing so manually at best. Over the long-term, however, this approach causes technical debt and risk to grow alarmingly. For regulated industries such as FinTech, this might even put the organisation at risk of penalties.

robots to the rescue

One way to try to balance these conflicting interests is to patch little and often. But unless you have a large team and resources dedicated just to this, finding enough time regularly can be a real struggle, and it’s easy for things to start falling behind. There is also the human element: who wants to be the engineer responsible for applying a broken patch, or potentially taking a service down unexpectedly at a critical time? In an effort to avoid such risks, some organisations create strictly-enforced maintenance windows, where such high-risk changes are only allowed to be applied during those times, or perhaps with direct and official sign-off from a manager.

But taking this path is counter-productive, and also doesn’t utilise the experience of the engineers on the ground. Firstly, there might not be enough time to apply all the updates safely within the maintenance period, causing work to rollover, or the maintenance periods might not be scheduled frequently enough. Secondly, having a manager directly control whether and when patches are allowed to be applied results in a lot of decision-making bureaucracy—often giving power over what can be highly technical decisions to people who might not have the relevant experience to make such a judgement call, whilst minimising the input and negating much of the experience of those who do.

For this reason, I’m in general against change advisory boards, except where they might operate at a high level over a roadmap spanning months rather than specific implementation or execution details. This goes for both programming and infrastructure engineering. I think the solution is to codify the agreed processes and have them executed by robots, impartially, regularly, and only pausing the automation in emergencies. In my infrastructure, updates are applied at all manner of times: when I’m working, overnight when I’m not, during weekdays, during weekends, during service-critical periods, and even during deploys. I personally would rather accept some occasional instability or even outages whilst implementing and ironing out a system, knowing that every fix and improvement I make is decreasing the chances of such occurring long-term, which ensuring that security risks are vastly reduced. But how to implement such an automatic patching schedule?

approach 1: deterministic randomness

The main idea is to have services in a cluster, and patching it one node at a time. If possible, clusters should be active-active, since this decreases the proportion of the traffic which is affected, and removes the need for promotion. For active-passive clusters such as a traditional relational database, the failover should be fast and automatic, and all services should be able to reconnect swiftly (and ideally, gracefully). If a node being patched is passive, then no failover of course is needed, although it’s important to consider reduced redundancy. If a node being patched is active, then the cluster will need to failover. I use this as an opportunity for a disaster recovery test, meaning that frequently my infrastructure is conducting a staggered DR test—all on its own!

Whilst it’s possible to use a centralised control method to schedule the patching, I tend to prefer to use a zero-knowledge approach. This decreases the complexity of the solution, and reduces the chance that something will go wrong unrelated to the patching process. When choosing the schedule itself, you should consider not only how long it will take for a cluster to failover, patch, and go online again, but also how long it will take you to rebuild a node entirely if there is a catastrophic failure.

implementation via Ansible

We start off by generating a random number 0-6 using Ansible:

day: "{{ 7 | random }}"

Since this is a day of the week, we shift it by 1 to be 1-7:

day: "{{ (7 | random) + 1 }}"

We adjust the random seed to make it deterministic, based on the hostname:

day: "{{ (7 | random(seed=inventory_hostname)) + 1 }}"

Then we shift the schedule to minimise collisions with other jobs on the same node:

day: "{{ (7 | random(seed=inventory_hostname + 'upgrade')) + 1 }}"

Putting this all together:

-
  name: CRON upgrade
  cron:
    user: root
    name: upgrade
    day: "{{ (7 | random(seed=inventory_hostname + 'upgrade')) + 1 }}"
    hour: "{{ 24 | random(seed=inventory_hostname + 'upgrade') }}"
    minute: "{{ 60 | random(seed=inventory_hostname + 'upgrade') }}"
    job: systemd-cat -t cron.upgrade /usr/local/sbin/upgrade

so many servers, so little time

Whilst this method is effective, the relatively small number of days per week means that a collision between node schedules is very likely, meaning that 2 nodes might upgrade on the same day. Deterministically randomising the hour and minute certainly helps, but there is a far bigger problem: what if there is a catastrophic failure, and the server goes offline and becomes unbootable? Unless you have a dedicated, round-the-clock support team, this could easily take the whole service offline. Even if you do have such resources, waking up your infrastructure team in the middle of the night is probably best avoided if at all possible. And also don’t forget that a new server could take some time to be rebuilt and come online; this applies to both automated and manual node replacements.

approach 2: modular arithmetic

We can improve the first approach to instead use modular arithmetic. The principle here is to divide the period evenly, upgrading every server at roughly equal intervals, like labelling some of the hours of a clock with which node to upgrade. By increasing the diameter of the clock, you can ensure that 1 or more days falls in between each upgrade, presuming that you have a small number of nodes in the cluster (such as 3 or 5).

Suppose you want to patch nodes roughly every 14 days, meaning approximately twice a month. You want every node to be upgraded on a different day, and to space these out as much as possible, to give you time to detect and recover from anything serious. Within each of these days, we still randomise the hour and minute of the patching, to prevent an infrastructure-wide stampede, and also to decrease load spikes and failed jobs on the node itself caused by conflicting with other scheduled jobs.

implementation via Ansible

We start off by setting some variables:

package:
  cluster: my_little_cluster
  frequency: 14

For simplicity, let’s run the schedule only for 28 days per month. We start by allocating each node in the cluster to a consecutive day within that 28-day period, and repeat the schedule every 14 days, shifting the index to start from the 1st day of the month:

day: "{{ (groups[package.cluster].index(inventory_hostname) + 1 }}-28/{{ package.frequency }}"

Next, we account for if there are more than 14 nodes in a cluster, by using modular arithmetic to wrap around:

day: "{{ (groups[package.cluster].index(inventory_hostname)) % package.frequency + 1 }}-28/{{ package.frequency }}"

However, we don’t want to purely assign to consecutive days; instead, we place the nodes at roughly equal spacing within that period using modular arithmetic:

day: "{{ ((groups[package.cluster].index(inventory_hostname) * (package.frequency / (groups[package.cluster] | count))) | int) % package.frequency + 1 }}-28/{{ package.frequency }}"

But what if there’s a bug in the upstream updates, for instance to the kernel or bootloader? This would cause multiple nodes to go down on the same day! Admittedly, they would be in different clusters, but if you have lots of servers (I’m currently operating over 80 just in my own infrastructure), this could still cause a lot of work without warning. So finally, we shift the entire schedule for that cluster using deterministic randomness, to make it much less likely that the schedules for different clusters will overlap:

day: "{{ (((groups[package.cluster].index(inventory_hostname) * (package.frequency / (groups[package.cluster] | count))) | int) + (package.frequency | random(seed=package.cluster + 'upgrade'))) % package.frequency + 1 }}-28/{{ package.frequency }}"

Putting this all together:

-
  name: CRON upgrade
  cron:
    user: root
    name: upgrade
    day: "{{ (((groups[package.cluster].index(inventory_hostname) * (package.frequency / (groups[package.cluster] | count))) | int) + (package.frequency | random(seed=package.cluster + 'upgrade'))) % package.frequency + 1 }}-28/{{ package.frequency }}"
    hour: "{{ 24 | random(seed=inventory_hostname + 'upgrade') }}"
    minute: "{{ 60 | random(seed=inventory_hostname + 'upgrade') }}"
    job: systemd-cat -t cron.upgrade /usr/local/sbin/upgrade

example

Here’s a real example from my own infrastructure, from the Isoxya next-generation crawling system. Amongst other things, Isoxya requires a database cluster (active-passive), a messaging cluster (active-active), and an application cluster (active-active using containers orchestrator), each containing 3 nodes.

database cluster schedule:

33 16  6-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade
18  9 11-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade
32 16  2-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade

messaging cluster schedule:

 2  1  5-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade
23 11  1-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade
19  9 10-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade

application cluster schedule:

32 16  1-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade
11  5  6-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade
45 22 11-28/14 * * systemd-cat -t cron.upgrade /usr/local/sbin/upgrade

As you can see here, actually the database and application cluster schedules are still overlapping. You might just accept this as an effect of the deterministic randomness method, or otherwise introduce a variable to shift the cluster schedules by a fixed integer (or potentially, even detect the index of the cluster group name within the entire set of groups).