This week in Mozilla RelEng – February 28th, 2014

Highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

Lifecycle of a Release Engineering bug

Like most other engineering groups at Mozilla, Release Engineering uses Bugzilla to track most of the work we do. Our bugs and patches use many of the same flags as a Firefox bug, so it’s no surprise that many people expect our bugs to be like Firefox bugs in other ways. I’ve noticed a few instances of confusion recently, so I’d like to try to clear things up a little bit by talking about what’s different about ours.

In order to understand our bugs, a brief overview of our repositories is required. Unlike Firefox, RelEng hosts its code in a number of different Mercurial repositories. Most of these repositories have (or should have) a “production” branch in addition to the “default” one. Our production systems (Buildbot masters/slaves, AWS management machines, signing servers, etc.) track the production branches of our repositories. This is important for a few reasons:

  • Some patches can depend on other patches in a different repository
  • Some patches need to be deployed at specific times
  • Some systems cannot automatically use new code
  • Every commit has the potential to close the trees

Once a patch has been reviewed it will be landed on the default branch of its repository. At this time the checked-in attachment flag is set to “+”. Unlike Firefox bugs, the bug stays open at this point. Many of our repositories have their own continuous integration tests that watch the default branches and do as much up-front verification as they can. Like mozilla-central, we do our best to stay in a shippable state at all times – so if we have test failures, they get fixed quickly or backed out. Sometime later (usually about once per day), someone will merge all of the pending patches to the production branches and make sure the production systems pick them up. Once this is done, they will update the Maintenance page and add a note like “In production” to all of the bugs that had patches merged. Any new jobs that start after this point will be using all of the new code. In most cases bugs will be left open, and closed by their assignee when appropriate.

I hope this helps the next time you’re confused about the state of some RelEng work!

This week in Mozilla RelEng – February 21th, 2014

Highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

This week in Mozilla RelEng – February 14th, 2014

Highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

Status update on smaller pools of build machines

Last week glandium and I wrote a bit about how shrinking the size of our build pools would help get results out faster. This week, a few of us are starting work on implementing that. Last week I called these smaller tools “hot tubs”, but we’ve since settled on the name “jacuzzis”.

We had a few discussions about this in the past few days and quickly realized that we don’t have a way of knowing upfront exactly how many machines or to allocate to each jacuzzi. This number will vary based on the frequency of the builders in the jacuzzi (periodic, nightly, or on-change) as well as the number of pushes to the builders’ branches. Because of this we are firmly committed to making these allocations dynamically adjustable. It’s possible that it may take us many attempts to get the allocations right – so we need to make it as painless as possible. This will also enable us to shift allocations if there’s a sudden change in load (eg, mozilla-inbound is closed for a day, but b2g-inbound gets twice the number of pushes as usual).

There’s two major pieces that need to be worked on: writing a jacuzzi allocator service, and making Buildbot respect it.

The allocator service will be a simple JSON API with the following interface. Below, $buildername is a specific type of job (eg “Linux x86-64 mozilla-inbound build”), $machinename is something like “bld-linux64-ec2-023″ and $pool is something like “bld-linux64″:

  • GET /builders/$buildername – Returns a list of machines that are allowed to perform this type of build.
  • GET /machines/$machinename – Returns a list of builders that this machine is allowed to build for.
  • GET /allocated/$poolname – Returns a list of all machines in the given pool that are allocated to any builder.

Making Buildbot respect the allocator is going to be a little tricky. It requires an explicit list of machines that can perform jobs from each builder. If we implement jacuzzis by adjusting this list, we won’t be able to adjust these dynamically. However, we can adjust some runtime code to override that list after talking to the allocator. We also need to make sure that we can fall back in cases where the allocator is unaware of a builder or unavailable.

To do this, we’re adding code to the Buildbot masters that will query the jacuzzi server for allocated machines before it starts a pending job. If the jacuzzi server returns a 404 (indicating that it’s unaware of the builder), we’ll get the full list of allocated machines from the /allocated/$poolname endpoint. We can subtract this from the full list of machines in the pool and try to start the job on one of the remaining ones. If the allocator service is unavailable for a long period of time we’ll just choose a machine from the full list.

This implementation has the nice side effect of allowing for a gradual roll out — we can simply leave most builders undefined on the jacuzzi server until we’re comfortable enough to roll it out more widely.

In order to get things going as quickly as possible I’ve implemented the jacuzzi allocator as static files for now, supporting only two builders on the Cedar tree. Catlee is working on writing the Buildbot code described above, and Rail is adjusting a few smaller tools to support this. John Hopkins, Taras, and Glandium were all involved in brainstorming and planning yesterday, too.

We’re hoping to get something working on the Cedar branch in production ASAP, possibly as early as tomorrow. While we fiddle with different allocations there we can also work on implementing the real jacuzzi allocator.

Stay tuned for more updates!

This week in Mozilla RelEng – February 7th, 2014 – new format (again)!

I’ve heard from a few people that the bug long list of bugs I’ve been provided for the past few weeks is a little difficult to read. Starting with this week I’ll be supplementing that with some highlights at the top that talk about some of the more important things that were worked on throughout the week.

Highlights:

Completed work (resolution is ‘FIXED’):

In progress work (unresolved and not assigned to nobody):

Experiments with smaller pools of build machines

Since the 3.0 days we’ve been using a pool of identical machines to build Firefox. It started off with a few machines per platform, and has since expanded into many, many more (close to 100 on Mac, close to 200 on Windows, and many more hundreds on Linux). This machine pooling is one of the main things that has enabled us to scale to support so many more branches, pushes, and developers. It means that we don’t need to close the trees when a single machine fails (anyone remember fx-win32-tbox?) and makes it easier to buy extra capacity like we’ve done with our use of Amazon’s EC2.

However, this doesn’t come without a price. On mozilla-inbound alone there are more than 30 different jobs that a Linux build machine can run. Multiply that by 20 or 30 branches and you get a Very Large Number. Having so many different types of jobs you can do, you rarely end up doing the same job twice in a row. This means that a very high percentage of our build jobs are clobbers. Even with ccache enabled, these take much more time to do than an incremental build.

This week I’ve run a couple of experiments using a smaller pool of machines (“hot tubs”) to handle a subset of job types on mozilla-inbound. The results have been somewhat staggering. A hot tub with 3 machines returned results in an average of 60% of the time our production pool did, but coalesced 5x the number of pushes. A hot tub with 6 machines returned results in an average of 65% of the time our production pool did, and only coalesced 1.4x the number of pushes. For those interested, the raw numbers are available

With more tweaks to the number of machines and job types in a hot tub I think we can make these numbers even better – maybe even to the point where we both give results sooner and reduce coalescing. We also have some technical hurdles to overcome in order to implement this in production. Stay tuned for further updates on this effort!