This week in Mozilla RelEng – February 14th, 2014

Ben Hearsum

2014-02-14 16:00

Comments

Highlights:

Aki spent most of his week braindumping Mozharness and VCS Sync knowledge before he goes on PTO, which will help to ensure we don't drop the ball on anything while he's gone.
Pete migrated repositories from the legacy vcs2vcs system to the much more stable VCS Sync system. Moving these reduces the chance of a tree closure.
Geoff turned on full ADB logcat logs for many Android tests. This should help significantly when debugging test failures on RelEng machines.
Glandium enabbled sccache for some try builds, which is expected to improve build times by having a much better cache hit rate than ccache.
Many of us started early work on jacuzzis, which will also improve build times when we're ready to deploy it more widely.
Rail increased our usage of AWS spot instances, which lowers our AWS bill with no impact to turnaround times.
Mike thinks he found the reason why "hg purge" fails in some circumstances. Once addressed, we can spend less time recreating full source directories, especially during try builds.
Armen enabled testing of b2g reftests on EC2 machines. Once problems with these are sorted out, we can move all of these tests to EC2, which allows us to achieve better reliability and scale for them.

Completed work (resolution is 'FIXED'):

Buildduty

General Automation

Loan Requests

Other

Platform Support

Releases

Releases: Custom Builds

Repos and Hooks

Need project branch for building the Mulet

Tools

In progress work (unresolved and not assigned to nobody):

Balrog: Backend

Buildduty

General Automation

Loan Requests

Other

Platform Support

release-automation

Repos and Hooks

disable & delete mirror of gaia-ui-tests

Tools

Status update on smaller pools of build machines

Ben Hearsum

2014-02-11 18:04

Comments

Last week glandium and I wrote a bit about how shrinking the size of our build pools would help get results out faster. This week, a few of us are starting work on implementing that. Last week I called these smaller tools "hot tubs", but we've since settled on the name "jacuzzis".

We had a few discussions about this in the past few days and quickly realized that we don't have a way of knowing upfront exactly how many machines or to allocate to each jacuzzi. This number will vary based on the frequency of the builders in the jacuzzi (periodic, nightly, or on-change) as well as the number of pushes to the builders' branches. Because of this we are firmly committed to making these allocations dynamically adjustable. It's possible that it may take us many attempts to get the allocations right - so we need to make it as painless as possible. This will also enable us to shift allocations if there's a sudden change in load (eg, mozilla-inbound is closed for a day, but b2g-inbound gets twice the number of pushes as usual).

There's two major pieces that need to be worked on: writing a jacuzzi allocator service, and making Buildbot respect it.

The allocator service will be a simple JSON API with the following interface. Below, $buildername is a specific type of job (eg "Linux x86-64 mozilla-inbound build"), $machinename is something like "bld-linux64-ec2-023" and $pool is something like "bld-linux64":

GET /builders/$buildername - Returns a list of machines that are allowed to perform this type of build.
GET /machines/$machinename - Returns a list of builders that this machine is allowed to build for.
GET /allocated/$poolname - Returns a list of all machines in the given pool that are allocated to any builder.

Making Buildbot respect the allocator is going to be a little tricky. It requires an explicit list of machines that can perform jobs from each builder. If we implement jacuzzis by adjusting this list, we won't be able to adjust these dynamically. However, we can adjust some runtime code to override that list after talking to the allocator. We also need to make sure that we can fall back in cases where the allocator is unaware of a builder or unavailable.

To do this, we're adding code to the Buildbot masters that will query the jacuzzi server for allocated machines before it starts a pending job. If the jacuzzi server returns a 404 (indicating that it's unaware of the builder), we'll get the full list of allocated machines from the /allocated/$poolname endpoint. We can subtract this from the full list of machines in the pool and try to start the job on one of the remaining ones. If the allocator service is unavailable for a long period of time we'll just choose a machine from the full list.

This implementation has the nice side effect of allowing for a gradual roll out -- we can simply leave most builders undefined on the jacuzzi server until we're comfortable enough to roll it out more widely.

In order to get things going as quickly as possible I've implemented the jacuzzi allocator as static files for now, supporting only two builders on the Cedar tree. Catlee is working on writing the Buildbot code described above, and Rail is adjusting a few smaller tools to support this. John Hopkins, Taras, and Glandium were all involved in brainstorming and planning yesterday, too.

We're hoping to get something working on the Cedar branch in production ASAP, possibly as early as tomorrow. While we fiddle with different allocations there we can also work on implementing the real jacuzzi allocator.

Stay tuned for more updates!

This week in Mozilla RelEng – February 7th, 2014 - new format (again)!

Ben Hearsum

2014-02-07 16:00

Comments

I've heard from a few people that the bug long list of bugs I've been provided for the past few weeks is a little difficult to read. Starting with this week I'll be supplementing that with some highlights at the top that talk about some of the more important things that were worked on throughout the week.

Highlights:

Aki did some work on improving our documentation about things that need to happen on merge days. Historically, this has been a hairy process, and cleaning this up will help us complete it more quickly and without mistakes.
Armen worked on extracting data to show how much CPU is used on EC2 machines. This data is a step on the road to more efficient use of them.
Hal started work to enable the tree closure to trees that don't yet have it. Enabling that will help sheriffs minimize tree closures of large branches (eg, mozilla-inbound) by having better control over incoming load.
Catlee, Callek, Mike, Simone, and Massimo all attended TRIBE Session 1.
Nick made it possible for us to ship Firefox 27.0 to our Beta users before we ship it to the Release channel. This gives us a better opportunity to find hard to reproduce bugs such as bug 865701 before we ship.
Many people both in and outside of RelEng helped debug and fix network load issues that caused massive tree closures. Catlee wrote an in-depth blog post on this for those interested.
Rail switched many of our EC2 machines from m3.xlarge to the cheaper (but just as fast) c3.xlarge instances.
I did some experiments with smaller pools of build machines which is a start towards more intelligent build machine selection.

Completed work (resolution is 'FIXED'):

Buildduty

General Automation

Loan Requests

Other

Platform Support

Releases

Releases: Custom Builds

Repos and Hooks

Clean up the repos that use the push_printurls hook

Tools

In progress work (unresolved and not assigned to nobody):

Balrog: Backend

Buildduty

General Automation

Loan Requests

Other

Platform Support

release-automation

Figure out how to offer release build to beta users

Releases

Releases: Custom Builds

figure out strategy for custom build support under Metro

Repos and Hooks

Tools

Experiments with smaller pools of build machines

Ben Hearsum

2014-02-06 12:56

Comments

Since the 3.0 days we've been using a pool of identical machines to build Firefox. It started off with a few machines per platform, and has since expanded into many, many more (close to 100 on Mac, close to 200 on Windows, and many more hundreds on Linux). This machine pooling is one of the main things that has enabled us to scale to support so many more branches, pushes, and developers. It means that we don't need to close the trees when a single machine fails (anyone remember fx-win32-tbox?) and makes it easier to buy extra capacity like we've done with our use of Amazon's EC2.

However, this doesn't come without a price. On mozilla-inbound alone there are more than 30 different jobs that a Linux build machine can run. Multiply that by 20 or 30 branches and you get a Very Large Number. Having so many different types of jobs you can do, you rarely end up doing the same job twice in a row. This means that a very high percentage of our build jobs are clobbers. Even with ccache enabled, these take much more time to do than an incremental build.

This week I've run a couple of experiments using a smaller pool of machines ("hot tubs") to handle a subset of job types on mozilla-inbound. The results have been somewhat staggering. A hot tub with 3 machines returned results in an average of 60% of the time our production pool did, but coalesced 5x the number of pushes. A hot tub with 6 machines returned results in an average of 65% of the time our production pool did, and only coalesced 1.4x the number of pushes. For those interested, the raw numbers are available

With more tweaks to the number of machines and job types in a hot tub I think we can make these numbers even better - maybe even to the point where we both give results sooner and reduce coalescing. We also have some technical hurdles to overcome in order to implement this in production. Stay tuned for further updates on this effort!

This week in Mozilla RelEng – January 31st, 2014

Ben Hearsum

2014-01-31 16:59

Comments

Completed work (resolution is 'FIXED'):

Balrog: Backend

Balrog: Frontend

rules.html has race condition

Buildduty

General Automation

Loan Requests

Slave loan request for a try-linux64-ec2 vm for fixing in-content preference tests

Other

[Tracking bug] Tryserver enhancements and improvements

Platform Support

release-automation

Releases

Releases: Custom Builds

Clean up MSN bundles and restore MSN add-on to the correct URL

Tools

In progress work (unresolved and not assigned to nobody):

Balrog: Backend

Buildduty

General Automation

Loan Requests

Other

Platform Support

release-automation

Figure out how to offer release build to beta users

Releases

Releases: Custom Builds

Repos and Hooks

Tools

This week in Mozilla RelEng – January 24th, 2014

Ben Hearsum

2014-01-24 14:37

Comments

Completed work (resolution is 'FIXED'):

Balrog: Backend

can't add locale information to a release under certain conditions

Buildduty

General Automation

Loan Requests

Other

linux desktop qt builds for firefox

Platform Support

Add talos-linux64-ix machines that were purchased for the Android x86 project

release-automation

Update text for av-vendor email

Releases

Repos and Hooks

Tools

In progress work (unresolved and not assigned to nobody):

Balrog: Backend

balrog should notify when a release is added (or modified?) by a human

Balrog: Frontend

rules.html has race condition

Buildduty

General Automation

Loan Requests

Other

Platform Support

release-automation

Releases

Repos and Hooks

Tools

The code that generates this report is available at https://github.com/bhearsum/this-week-in-bugs

This week in Mozilla RelEng – January 17th, 2014 -- new format!

Ben Hearsum

2014-01-17 16:23

Comments

This week I'm going to try something new and try to report on in progress bugs in addition to completed ones. I finally got around to scripting this, too (at Catlee's urging), and thus the categories have changed to match the Release Engineering Bugzilla components exactly. The code that generates this is available in a github repository for the curious:

In progress work (unresolved and not assigned to nobody):

Buildduty

Re-purpose mw32-ix-slave##, linux-ix-slave##, linux64-ix-slave##, bld-linux64-ix-05[1-3], mw32-ix-ref and linux-ix-ref as b-2008-ix-#### (rev2) machines

General Automation

Loan Requests

Other

Upgrade ASan Clang in Q1

Platform Support

release-automation

Figure out how to offer release build to beta users

Tools

Completed work (resolution is 'FIXED'):

Buildduty

General Automation

Loan Requests

Platform Support

Repos and Hooks

Request to mirror i686-linux-android-4.7 for FFOS emulator-x86-jb build

Tools

This week (and a half) in Mozilla RelEng – January 10th, 2014

Ben Hearsum

2014-01-10 15:08

Comments

Emergency responses/bustage fixes:

General automation changes:

Build/test platform changes:

Other:

This week in Mozilla RelEng - December 20th, 2013

Ben Hearsum

2013-12-20 15:49

Comments

Like last week, this is a very rough approximation of RelEng related work that completed this week. Because this is the last week before vacation, we haven't landed much in the past couple of days, and some regularly scheduled release work was postphoned.

Emergency responses/bustage fixes:

Regularly scheduled work:

General automation changes:

Build/test platform changes:

Update sixgill RPMs again

Other:

This week in Mozilla RelEng - December 13th, 2013

Ben Hearsum

2013-12-13 15:16

Comments

I thought it might be interesting for folks to get an overview of the rate of change of RelEng infrastructure and perhaps a better idea of what RelEng work looks like. The following list was generated by looking at Release Engineering bugs that were marked as FIXED in the past week. Because it looks at only the "Release Engineering" product in Bugzilla it doesn't represent everything that everyone in RelEng did, nor was all the work below done by people on the Release Engineering team, but I think it's a good starting point!

Emergency responses:

Regularly scheduled work:

General automation changes:

Build/test platform changes:

Build/test machine requests:

Other: