CRAB Cardano stake pool: botched KES key rotation incident report

2020-10-06 · computing

I’m sorry to report that there has been an incident with the CRAB Cardano stake pool: when rotating the KES key a few days ago, I did it incorrectly, meaning that 4 blocks from the current Epoch 221 were missed. Here follows an incident report, explaining the context, what happened, how it was discovered, the effects, and the remediations put in place to decrease the chance of this happening in the future. I’m publishing this information both in the interests of transparency to members of the CRAB stake pool, and to help other Cardano stake pool operators to avoid making similar mistakes.

context

A Cardano stake pool has many keys, certificates, and addresses: payment key pair, stake key pair, payment address, stake address, cold key pair, cold counter, VRF key pair, KES key pair, operational certificate, delegation certificate, and registration certificate—as well as related metadata which must also be checksummed. All this must be configured perfectly in order for the stake pool to have a chance of making blocks—and none of this even takes into account other things such as software versions, networking topology, and firewalls. It’s an extremely complex (and convoluted?) process—that’s not just my opinion; Staking Rewards lists running a Cardano stake pool as ‘hard’ complexity with ‘moderate’ risk. The KES key pair operates as a ‘hot key’, and needs to be rotated every 90 days; this is a condition of operating a stake pool on the Cardano blockchain.

what happened

I have monitoring which tracks, amongst other things, the expiration of the KES key. On 1 Oct, there being only a few days left before it became invalid, I did the maintenance work to rotate it. This was the first time I’ve done such since the Shelley era launch, so although I was able to lean on various helpful posts online, I didn’t yet have my own set of notes specific to my infrastructure. Rotating the KES key requires: generating a new KES key pair, calculating the start KES period, generating a new operational certificate, signing that with the stake pool cold key, deploying the updated KES secret key and node operational certificate to the block-producing node of the stake pool, and bouncing that node to pick up the changes.

It being a cryptocurrency financial system, for security, I keep all cold keys offline elsewhere, meaning that I deploy only the minimum necessary files to the stake pool, which are: KES secret key (kes.skey), operational certificate (node.cert), VRF secret key (vrf.skey), and network topology (these files can be named differently, but they’re what’s defaulted in my open-source Cardano Node Docker images). After generating the new KES key pair and operational certificate, I deployed the files to the stake pool and bounced the block-producing node. Or so I thought! In fact, I accidentally didn’t transfer the new KES secret key, meaning that the stake pool was running the new operational certificate with the old KES secret key.

how it was discovered

After updating the stake pool, I watched carefully to check that everything restarted correctly. The block-producing node booted without any errors reported, quickly caught up the previous few minutes’ activity on the blockchain, and continued as normal to output information about other slot leaders on the network. My monitoring also updated, showing that the new KES key expiration was detected, and that there were now around 90 days left for the new key. Since CRAB has historically been a small pool (just in this same epoch having grown much bigger), blocks have typically been made only every few days—sometimes, only every couple of weeks. Despite the infrequency, the pool has typically been very profitable (for delegators, at least; not for me as a stake pool operator, since the rewards are minuscule, and the pool is running at a significant loss).

One of the problems with making blocks so infrequently, however, is that it is perfectly possible (indeed, mathematically probable) that for some epochs, there might be no blocks made. Thus, when Epoch 220 finished on 2 Oct and there were no blocks, I was a little disappointed, but not concerned; after all, there had been 1 block in each of Epoch 217 and 218, which was moderately lucky. For the current Epoch 221, however, CRAB went live with much higher active stake, which for the first time made the expected blocks per epoch over 1—6.1, in fact. I was surprised when no blocks were made on 3 Oct, but again, this is perfectly possible mathematically; after all, these are still somewhat small numbers.

When there was still nothing on 4 Oct, however, I began to be suspicious. I spent some time double-checking the stake pool, and verifying that the monitoring was working correctly. The block-producing node was still running without errors, was up-to-date with the correct block height, everything was connected properly to the relay nodes which also had the correct block height, and there were plenty of connections established in and out in the Cardano network to other relays and stake pools. After some time examining it, I came to the conclusion that we were just being rather unlucky—just like we were exceedingly lucky in Epoch 213 (3000% !).

When it got to 5 Oct and there were still no blocks, my level of suspiciousness got so high, I went back and revisited everything from scratch. I examined the config files loaded, and compared checksums between the offline and online copies. I restored point-in-time backups and compared those checksums again. I went through the notes I’d made when rotating the key, examining every line and comparing to what I expected in the stake pool. And then I suddenly realised that I’d completely missed uploading the new KES secret key when I deployed the new operational certificate!

Despite having monitoring, I hadn’t specifically flagged messages which indicated that the KES key was invalid, and that the blocks being made were being rejected. This is an oversight, but equally, those messages occurred for just a few seconds within a period of days, were never repeated, and continued without reporting further issues. For reference, those error messages look something like:

Forged invalid block in slot <SLOT NO>, reason: ValidationError (ExtValidationErrorHeader (HeaderProtocolError (HardForkValidationErrFromEra S (Z (WrapValidationErr {unwrapValidationErr = ChainTransitionError [OverlayFailure (OcertFailure (InvalidKesSignatureOCERT 78 76 2 "Reject"))]})))))

effects

The mistake was made on 1 Oct, and was rectified early on 5 Oct. That period spanned two epochs. A post-mortem indicated that in Epoch 220, no blocks were missed. Over the weekend in the current Epoch 221, however, 3 blocks were missed on Sat 3 Oct, and 1 block was missed on Sun 4 Oct. 1 block was subsequently made correctly on 5 Oct after the fix, and is verifiable in the usual blockchain explorers. Given the nature of cryptocurrency blockchains, there is nothing that can be done about the 4 lost blocks, and any rewards from those are lost.

remediations

As the software stands, it’s actually pretty hard to spot this type of problem, and very probably not possible at all to spot it until blocks are missed. Ideally, I would have detected the error on the first missed block. Not having experienced such an issue before, however, and not knowing what to look for, it took me a few days to detect. It’s worth noting that with no leadership election schedule being published, it’s seemingly not possible to detect in advance when the stake pool should actually be making blocks, compared to when it’s expected to sit there indefinitely awaiting such work. Given that I’ve identified the specific log lines in the post-mortem, however, I’ve now extended the monitoring to flag such messages in future. Since I’m also much more mindful of this issue having now experienced it, and having made my own specific notes for KES key rotation to use next time, the mistake is much less likely to occur in future.