I've written about the history of our Release Automation systems in the past. We've gone from mostly manual releases to almost completely automated since I joined Mozilla. One thing I haven't talked about before is Ship It - our web tool for kicking off releases:
It may be ugly, but having it has meant that we don't have to log on to a single machine to ship a release. A release engineer doesn't even need to be around to start the release process - Release Management has direct access to Ship It to do it themselves. We're only needed to push releases live, and that's something we'd like to fix as well. We're looking at tackling that and other ancillary issues of releases, such as:
Rail and I had a brainstorming session about this yesterday and a theme that kept coming up was that most of the things we want to improve are on the edges of release automation: they happen either before the current automation starts, or after the current automation ends. Everything in this list also needs someone to decide that it needs to happen -- our automation can't make the decision about what revision a release should be built with or when to push it to Google Play - it only knows how to do those things after being told that it should. These points where we jump back and forth between humans and automation are a big rough edge for us right now. The way they're implemented currently is very situation-specific, which means that adding new points of human-automation interaction is slow and full of uncertainty. This is something we need to fix in order to continue to ship as fast and effectively as we do.
We think we've come up a new design that will enable us to deal with all of the current human-automation interactions and any that come up in the future. It consists of three key components:
A workflow is a DAG that represents an entire release process. It consists of human steps, automation steps, and potentially other types. An important point about workflows is that they aren't necessarily the same for every release. A Firefox Beta's workflow is different than a Fennec Beta or Firefox Release. The workflow for a Firefox Beta today may look very different than for one a few months from now. The details of a workflow are explicitly not baked into the system - they are part of the data that feeds it. Each node in the DAG will have upstreams, downstreams, and perhaps a list of notifications. The tooling around the workflow will respond to changes in state of each node and determine what can happen next. Much of each workflow will end up being the existing graph of Buildbot builders (eg: this graph of Firefox Beta jobs).
We're hoping to use existing software for this part. We've looked at Amazon's Simple Workflow Service already, but it doesn't support any dependencies between nodes, so we're not sure if it's going to fit the bill. We're also looking at Taskcluster which does do dependency management. If anyone knows of anything else that might be useful here please let know!
As well as continuing to provide a human interface, Ship It will be the API between the workflow tool and humans/automation. When new nodes become ready it makes that information available to automation, or gives humans the option to enact them (depending on node type). It also receives state changes of nodes from automation (eg, build completion events). Ship It may also be given the responsibility of enforcing user ACLs.
Release Runner is the binding between Ship It and the backend parts of the automation. When Ship It is showing automation events ready to start, it will poke the right systems to make them go. When those jobs complete, it will send that information back to Ship It.
This will likely be getting a better name.
This design still needs some more thought and review, but we're very excited to be moving towards a world where humans and machines can integrate more seamlessly to get you the latest Firefox hotness more quickly and securely.