This week in Mozilla RelEng – February 14th, 2014

Highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

Status update on smaller pools of build machines

Last week glandium and I wrote a bit about how shrinking the size of our build pools would help get results out faster. This week, a few of us are starting work on implementing that. Last week I called these smaller tools "hot tubs", but we've since settled on the name "jacuzzis".

We had a few discussions about this in the past few days and quickly realized that we don't have a way of knowing upfront exactly how many machines or to allocate to each jacuzzi. This number will vary based on the frequency of the builders in the jacuzzi (periodic, nightly, or on-change) as well as the number of pushes to the builders' branches. Because of this we are firmly committed to making these allocations dynamically adjustable. It's possible that it may take us many attempts to get the allocations right - so we need to make it as painless as possible. This will also enable us to shift allocations if there's a sudden change in load (eg, mozilla-inbound is closed for a day, but b2g-inbound gets twice the number of pushes as usual).

There's two major pieces that need to be worked on: writing a jacuzzi allocator service, and making Buildbot respect it.

The allocator service will be a simple JSON API with the following interface. Below, $buildername is a specific type of job (eg "Linux x86-64 mozilla-inbound build"), $machinename is something like "bld-linux64-ec2-023" and $pool is something like "bld-linux64":

  • GET /builders/$buildername - Returns a list of machines that are allowed to perform this type of build.
  • GET /machines/$machinename - Returns a list of builders that this machine is allowed to build for.
  • GET /allocated/$poolname - Returns a list of all machines in the given pool that are allocated to any builder.

Making Buildbot respect the allocator is going to be a little tricky. It requires an explicit list of machines that can perform jobs from each builder. If we implement jacuzzis by adjusting this list, we won't be able to adjust these dynamically. However, we can adjust some runtime code to override that list after talking to the allocator. We also need to make sure that we can fall back in cases where the allocator is unaware of a builder or unavailable.

To do this, we're adding code to the Buildbot masters that will query the jacuzzi server for allocated machines before it starts a pending job. If the jacuzzi server returns a 404 (indicating that it's unaware of the builder), we'll get the full list of allocated machines from the /allocated/$poolname endpoint. We can subtract this from the full list of machines in the pool and try to start the job on one of the remaining ones. If the allocator service is unavailable for a long period of time we'll just choose a machine from the full list.

This implementation has the nice side effect of allowing for a gradual roll out -- we can simply leave most builders undefined on the jacuzzi server until we're comfortable enough to roll it out more widely.

In order to get things going as quickly as possible I've implemented the jacuzzi allocator as static files for now, supporting only two builders on the Cedar tree. Catlee is working on writing the Buildbot code described above, and Rail is adjusting a few smaller tools to support this. John Hopkins, Taras, and Glandium were all involved in brainstorming and planning yesterday, too.

We're hoping to get something working on the Cedar branch in production ASAP, possibly as early as tomorrow. While we fiddle with different allocations there we can also work on implementing the real jacuzzi allocator.

Stay tuned for more updates!

This week in Mozilla RelEng – February 7th, 2014 - new format (again)!

I've heard from a few people that the bug long list of bugs I've been provided for the past few weeks is a little difficult to read. Starting with this week I'll be supplementing that with some highlights at the top that talk about some of the more important things that were worked on throughout the week.

Highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

Experiments with smaller pools of build machines

Since the 3.0 days we've been using a pool of identical machines to build Firefox. It started off with a few machines per platform, and has since expanded into many, many more (close to 100 on Mac, close to 200 on Windows, and many more hundreds on Linux). This machine pooling is one of the main things that has enabled us to scale to support so many more branches, pushes, and developers. It means that we don't need to close the trees when a single machine fails (anyone remember fx-win32-tbox?) and makes it easier to buy extra capacity like we've done with our use of Amazon's EC2.

However, this doesn't come without a price. On mozilla-inbound alone there are more than 30 different jobs that a Linux build machine can run. Multiply that by 20 or 30 branches and you get a Very Large Number. Having so many different types of jobs you can do, you rarely end up doing the same job twice in a row. This means that a very high percentage of our build jobs are clobbers. Even with ccache enabled, these take much more time to do than an incremental build.

This week I've run a couple of experiments using a smaller pool of machines ("hot tubs") to handle a subset of job types on mozilla-inbound. The results have been somewhat staggering. A hot tub with 3 machines returned results in an average of 60% of the time our production pool did, but coalesced 5x the number of pushes. A hot tub with 6 machines returned results in an average of 65% of the time our production pool did, and only coalesced 1.4x the number of pushes. For those interested, the raw numbers are available

With more tweaks to the number of machines and job types in a hot tub I think we can make these numbers even better - maybe even to the point where we both give results sooner and reduce coalescing. We also have some technical hurdles to overcome in order to implement this in production. Stay tuned for further updates on this effort!

This week in Mozilla RelEng – January 31st, 2014

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

This week in Mozilla RelEng – January 24th, 2014

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

The code that generates this report is available at https://github.com/bhearsum/this-week-in-bugs

This week in Mozilla RelEng – January 17th, 2014 -- new format!

This week I'm going to try something new and try to report on in progress bugs in addition to completed ones. I finally got around to scripting this, too (at Catlee's urging), and thus the categories have changed to match the Release Engineering Bugzilla components exactly. The code that generates this is available in a github repository for the curious:

In progress work (unresolved and not assigned to nobody):

Completed work (resolution is 'FIXED'):

This week (and a half) in Mozilla RelEng – January 10th, 2014

This week in Mozilla RelEng - December 20th, 2013

Like last week, this is a very rough approximation of RelEng related work that completed this week. Because this is the last week before vacation, we haven't landed much in the past couple of days, and some regularly scheduled release work was postphoned.

This week in Mozilla RelEng - December 13th, 2013

I thought it might be interesting for folks to get an overview of the rate of change of RelEng infrastructure and perhaps a better idea of what RelEng work looks like. The following list was generated by looking at Release Engineering bugs that were marked as FIXED in the past week. Because it looks at only the "Release Engineering" product in Bugzilla it doesn't represent everything that everyone in RelEng did, nor was all the work below done by people on the Release Engineering team, but I think it's a good starting point!