As the release manager for deephaven-core, I have the pleasure of kicking off our monthly release process. It's a mostly automated affair via GitHub Actions with a few manual steps along the way. Things usually go smoothly...
I got lazy.
A pull request (PR) landed on me for review. I'm a fan of fast PR turnaround times, especially for small, non-controversial dependency version bumps. I was away from my computer, but saw the PR on the GitHub app on my phone. A single dependency version bump, going from WEB_VERSION=0.15.1
-> WEB_VERSION=0.15.2
- easy: clicks Approved.
As it turns out, this was not so easy. This is the story of how I accidentally overwrote our product's Docker images, then rescued our release using the crane tool.
Problem
A little bit later I was notified that the release process had failed! What release!? I looked back at the PR and noticed it was targetting a release branch instead of main
, something I had completely missed during the review process. The automated release process had been kicked off with our old release branch.
Luckily, Maven and PyPi enforce unique versions - once you release version x.y.z
, you can't overwrite version x.y.z
. So, from this perspective, we were safe. But Docker tags (you've probably used Docker latest
tags before) don't work like this... and unfortunately, we had just overwritten our 0.15.0
Docker release tags.
Time to fix this mistake. One reasonable option would be to revert (or hard-reset) the release branch and re-re-kick off the release process. The Docker images would be re-built, re-tagged, and re-published, but they wouldn't be bit-for-bit identical to our original release.
Another option would be to re-tag and re-push them to the repository myself, but that means I would need to have all the release images local to my machine. I didn't already have the images locally, but I could get the SHA-digest from our previous release logs. Unfortunately, the size of the images and my slow internet would make this a lengthy process. (This option might also be tricky with multi-architecture images, unless great care is taken.)
Solution
Enter crane, "a tool for interacting with remote images and registries" (not to be confused with another Docker tool by the same name). It's an impressive command-line interface with a lot of subcommands. We're interested in crane tag, "efficiently tag a remote image".
We use the docker/build-push-action as part of the build process. The logs here are very helpful - scrolling to the end of the action, we can see the digest is explicitly printed out:
Next I executed:
crane tag ghcr.io/deephaven/server@sha256:ea6ca1c9b758f33b164924296a41d009a0fa3d3e10c8ac3c361b7e4e6d0f70b8 0.15.0
A few seconds later, the repository tags were correctly pointing at the correct release image. I should note, I was lucky that the GitHub Container Registry still had the original images. I'm not sure what process GitHub use for deleting untagged images - but I'm glad we were quick enough to not find them deleted.
Conclusion
There are some lessons to be learned here. Of course, the conditions and checks around the release process should be improved: "Has this version number already been released? If yes, stop the release process." On a personal note, I'm also going to be much more careful on "simple" PRs, and will never forget to check the target branch again.