Experiments with smaller pools of build machines

Since the 3.0 days we've been using a pool of identical machines to build Firefox. It started off with a few machines per platform, and has since expanded into many, many more (close to 100 on Mac, close to 200 on Windows, and many more hundreds on Linux). This machine pooling is one of the main things that has enabled us to scale to support so many more branches, pushes, and developers. It means that we don't need to close the trees when a single machine fails (anyone remember fx-win32-tbox?) and makes it easier to buy extra capacity like we've done with our use of Amazon's EC2.

However, this doesn't come without a price. On mozilla-inbound alone there are more than 30 different jobs that a Linux build machine can run. Multiply that by 20 or 30 branches and you get a Very Large Number. Having so many different types of jobs you can do, you rarely end up doing the same job twice in a row. This means that a very high percentage of our build jobs are clobbers. Even with ccache enabled, these take much more time to do than an incremental build.

This week I've run a couple of experiments using a smaller pool of machines ("hot tubs") to handle a subset of job types on mozilla-inbound. The results have been somewhat staggering. A hot tub with 3 machines returned results in an average of 60% of the time our production pool did, but coalesced 5x the number of pushes. A hot tub with 6 machines returned results in an average of 65% of the time our production pool did, and only coalesced 1.4x the number of pushes. For those interested, the raw numbers are available

With more tweaks to the number of machines and job types in a hot tub I think we can make these numbers even better - maybe even to the point where we both give results sooner and reduce coalescing. We also have some technical hurdles to overcome in order to implement this in production. Stay tuned for further updates on this effort!


Comments powered by Disqus