Mozilla has been using Buildbot as its continuous integration system for Firefox and Fennec for many years now. It enabled us to switch from a machine-per-build model to a pool-of-slaves model, and greatly aided us in getting to our current scale. But it's not perfect - and we've known for a few years that we'll need to do an overhaul. Lucky for us, the FirefoxOS Automation team has built up a fantastic piece of infrastructure known as Taskcluster that we're eager to start moving to.
It's not going to be a small task though - it will take a lot more work than taking our existing build scripts and running them in Taskcluster. One reason for this is that many of our jobs trigger other jobs, and Buildbot manages those relationships. This means that if we have a build job that triggers a test job, we can't move one without moving the other. We don't want to be forced into moving entire job chains at once, so we need something to help us transition more slowly. Our solution to this is to make it possible to schedule jobs in Taskcluster while still implementing them in Buildbot. Once the scheduling is in Taskcluster it's possible to move individual jobs to Taskcluster one at a time. The software that makes this possible is the Buildbot Bridge.
The Bridge is responsible for synchronizing job state between Taskcluster and Buildbot. Jobs that are requested through Taskcluster will be created in Buildbot by the Bridge. When those jobs complete, the Bridge will update Taskcluster with their status. Let's look at a simple example to see see how the state changes in both systems over the course of a job being submitted and run:
Event | Taskcluster state | Buildbot state |
---|---|---|
Task is created | Task is pending | -- |
Bridge receives "task-pending" event, creates BuildRequest | Task is pending | Build is pending |
Build starts in Buildbot | Task is pending | Build is running |
Bridge receives "build started" event, claims the Task | Task is running | Build is running |
Build completes successfully | Task is running | Build is completed |
Bridge receives "build finished" event, reports success to Taskcluster | Task is resolved | Build is completed |
The details of how this work are a bit more complicated - if you'd like to learn more about that I recommend watching the presentation I did about the Bridge architecture, or just have a read through my slides