Status update on smaller pools of build machines

Last week glandium and I wrote a bit about how shrinking the size of our build pools would help get results out faster. This week, a few of us are starting work on implementing that. Last week I called these smaller tools "hot tubs", but we've since settled on the name "jacuzzis".

We had a few discussions about this in the past few days and quickly realized that we don't have a way of knowing upfront exactly how many machines or to allocate to each jacuzzi. This number will vary based on the frequency of the builders in the jacuzzi (periodic, nightly, or on-change) as well as the number of pushes to the builders' branches. Because of this we are firmly committed to making these allocations dynamically adjustable. It's possible that it may take us many attempts to get the allocations right - so we need to make it as painless as possible. This will also enable us to shift allocations if there's a sudden change in load (eg, mozilla-inbound is closed for a day, but b2g-inbound gets twice the number of pushes as usual).

There's two major pieces that need to be worked on: writing a jacuzzi allocator service, and making Buildbot respect it.

The allocator service will be a simple JSON API with the following interface. Below, $buildername is a specific type of job (eg "Linux x86-64 mozilla-inbound build"), $machinename is something like "bld-linux64-ec2-023" and $pool is something like "bld-linux64":

  • GET /builders/$buildername - Returns a list of machines that are allowed to perform this type of build.
  • GET /machines/$machinename - Returns a list of builders that this machine is allowed to build for.
  • GET /allocated/$poolname - Returns a list of all machines in the given pool that are allocated to any builder.

Making Buildbot respect the allocator is going to be a little tricky. It requires an explicit list of machines that can perform jobs from each builder. If we implement jacuzzis by adjusting this list, we won't be able to adjust these dynamically. However, we can adjust some runtime code to override that list after talking to the allocator. We also need to make sure that we can fall back in cases where the allocator is unaware of a builder or unavailable.

To do this, we're adding code to the Buildbot masters that will query the jacuzzi server for allocated machines before it starts a pending job. If the jacuzzi server returns a 404 (indicating that it's unaware of the builder), we'll get the full list of allocated machines from the /allocated/$poolname endpoint. We can subtract this from the full list of machines in the pool and try to start the job on one of the remaining ones. If the allocator service is unavailable for a long period of time we'll just choose a machine from the full list.

This implementation has the nice side effect of allowing for a gradual roll out -- we can simply leave most builders undefined on the jacuzzi server until we're comfortable enough to roll it out more widely.

In order to get things going as quickly as possible I've implemented the jacuzzi allocator as static files for now, supporting only two builders on the Cedar tree. Catlee is working on writing the Buildbot code described above, and Rail is adjusting a few smaller tools to support this. John Hopkins, Taras, and Glandium were all involved in brainstorming and planning yesterday, too.

We're hoping to get something working on the Cedar branch in production ASAP, possibly as early as tomorrow. While we fiddle with different allocations there we can also work on implementing the real jacuzzi allocator.

Stay tuned for more updates!

Comments

Comments powered by Disqus