Tag Archives: buildbot

Status update on smaller pools of build machines

Last week glandium and I wrote a bit about how shrinking the size of our build pools would help get results out faster. This week, a few of us are starting work on implementing that. Last week I called these smaller tools “hot tubs”, but we’ve since settled on the name “jacuzzis”.

We had a few discussions about this in the past few days and quickly realized that we don’t have a way of knowing upfront exactly how many machines or to allocate to each jacuzzi. This number will vary based on the frequency of the builders in the jacuzzi (periodic, nightly, or on-change) as well as the number of pushes to the builders’ branches. Because of this we are firmly committed to making these allocations dynamically adjustable. It’s possible that it may take us many attempts to get the allocations right – so we need to make it as painless as possible. This will also enable us to shift allocations if there’s a sudden change in load (eg, mozilla-inbound is closed for a day, but b2g-inbound gets twice the number of pushes as usual).

There’s two major pieces that need to be worked on: writing a jacuzzi allocator service, and making Buildbot respect it.

The allocator service will be a simple JSON API with the following interface. Below, $buildername is a specific type of job (eg “Linux x86-64 mozilla-inbound build”), $machinename is something like “bld-linux64-ec2-023″ and $pool is something like “bld-linux64″:

  • GET /builders/$buildername – Returns a list of machines that are allowed to perform this type of build.
  • GET /machines/$machinename – Returns a list of builders that this machine is allowed to build for.
  • GET /allocated/$poolname – Returns a list of all machines in the given pool that are allocated to any builder.

Making Buildbot respect the allocator is going to be a little tricky. It requires an explicit list of machines that can perform jobs from each builder. If we implement jacuzzis by adjusting this list, we won’t be able to adjust these dynamically. However, we can adjust some runtime code to override that list after talking to the allocator. We also need to make sure that we can fall back in cases where the allocator is unaware of a builder or unavailable.

To do this, we’re adding code to the Buildbot masters that will query the jacuzzi server for allocated machines before it starts a pending job. If the jacuzzi server returns a 404 (indicating that it’s unaware of the builder), we’ll get the full list of allocated machines from the /allocated/$poolname endpoint. We can subtract this from the full list of machines in the pool and try to start the job on one of the remaining ones. If the allocator service is unavailable for a long period of time we’ll just choose a machine from the full list.

This implementation has the nice side effect of allowing for a gradual roll out — we can simply leave most builders undefined on the jacuzzi server until we’re comfortable enough to roll it out more widely.

In order to get things going as quickly as possible I’ve implemented the jacuzzi allocator as static files for now, supporting only two builders on the Cedar tree. Catlee is working on writing the Buildbot code described above, and Rail is adjusting a few smaller tools to support this. John Hopkins, Taras, and Glandium were all involved in brainstorming and planning yesterday, too.

We’re hoping to get something working on the Cedar branch in production ASAP, possibly as early as tomorrow. While we fiddle with different allocations there we can also work on implementing the real jacuzzi allocator.

Stay tuned for more updates!

Buildbot Scheduler and Builder graphing

One of the the most important systems I work is the release automation for Firefox and Thunderbird. The process behind the automation long predates me, but I’ve been deeply involved in automating, refining, and optimizing it. It shouldn’t come as any surprise that one of the biggest challenges of working on such a complex system is understanding how the smaller pieces fit together to make the whole system. For the release automation we have an advantage though: the smaller pieces are generally Buildbot Builders, and the things that fit them together are generally Buildbot Schedulers. Awhile ago I was improving parallelism for l10n repacks and found it extremely difficult to reason about whether or not my changes would actually create the desired Builders and string them together correctly. I threw together some (terrible) code that spat out a digraph of the release automation’s Builders and Schedulers. By comparing the before and after graphs I was able to iterate on some parts of my code without spending hours and hours testing.

This week I finally got around to tidying up and packaging this code as a more general purpose tool. It’s not nearly complete and has many rough edges, but as a very basic tool to help you understand non-trivial Buildbot installations, I think it’s wonderful. It’s pip installable (“buildbot-scheduler-graph”) and available on Github. Once you’ve got it, try it out with “buildbot-scheduler-graph /path/to/your/master.cfg /path/to/output-dir”. Here’s what Mozilla’s scheduler graphs looks like. What do yours look like?

Which build infrastructure problems do you see the most?

I’m hoping to tackle bug 505512 (Make infrastructure related problems turn the tree a color other than red) in the next few weeks. Most of the ground work for it is laid, which means that most of what I’ll be doing is parsing logs for infrastructure errors.

So, what errors do you see most from our build infrastructure? Are there other things that you would classify as infrastructure issues? Please add any suggestions you have to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors