release-automation - Part 3: Improvements & Optimizations (2009 to early 2011)
In my last post I talked about the major project of switching the release-automation from Bootstrap driven by Buildbot to being directly implemented in Buildbot, and working out of Mercurial. After a 6 month break from automation work, there were a few spurts of development on the automation over the course of the next two years. Some of these were big new things, like Fennec automation, while others were deliberate attempts to improve the automation. This post will cover the most important changes that happened from late 2009 all the way through early 2011.
Late 2009 to mid 2010
Fennec release-automation
In 2009 Mozilla began working on a version of Firefox for Maemo. Late in that year, we shipped 1.0rc1 with the release-automation. Some people may be thinking "that doesn't sound very hard, it's just another platform right?". Unfortunately, there's a lot of hidden complexity with adding a new platform that doesn't confirm to long-held assumptions, like mobile. While the actual build process is fairly similar there's a lot of pre-build, post-build, and other things that just aren't the same. Fennec was the first product we supported that was built out of multiple source repositories, which not only caused problems for builds (and isn't handled well by Buildbot), but affected how we tag repositories and generate source tarballs. L10n repacks were also completely different for Fennec: not only did we ship individual l10n builds for many locales, but we also shipped builds with multiple locales in them. Doing this meant build process changes as well as a new format to describe locales, revisions, and which types of repacks each one needed. All of this combined ended up being nearly a month of work (and many late nights, Aki tells me) to get up and running! This was the first product we've ever shipped that had automated releases from the start, which is a huge accomplishment for forwarding thinking & planning - something that we simply didn't have time for in the past. It's hard to determine how many hours of end2end time and # of manual touchpoints this saved since it was never manual work to begin with but there's no doubt that we're far better off with it than without.
Major Update
In the latter half of 2009 we started doing a lot of Major Updates. That is, offering 3.0.x users an update to a 3.5.x release. Behind the scenes, each Major Update offer took approximately 4 hours to create and had at least 6 or 7 manual touchpoints in order to do config file bumping, snippet generation, test snippet pushing, and verification of those snippets. Each one had an end2end time of 4 hours or so and had at least 6 or 7 manual touchpoints. If that wasn't bad enough, a single mistake in the configuration file would cause us to have to restart the entire process! Automating this turned out to be one of the easier pieces of new automation because of how similar Major Updates were to the regular updates we already did with every release. When this relatively simple work was done, all of the manual touchpoints were gone completely and because these were now done automatically with a release instead of out of band they moved out of the critical path and therefore had no end2end time impact either! This is always the best kind of new automation =).
Bouncer Entries
In mid-2010 we automated a long standing annoyance: Bouncer entry creation. Like the Major Updates, this was something that was subject to manual error. More importantly, it was _damn_ annoying to do. Bouncer is the piece of software that powers download.mozilla.org, which redirects download requests to our mirror network. Each time we release we need to tell it where to find the files we ship. This translates to one entry for each installer, complete MAR, and partial MAR for each platform. Prior to this being fixed this was done mostly through copy and paste which has a massive margin for error. In the best case scenario this means we'll get some 404s, which are easy to detect and fix. In the worst case we could point at the wrong release entirely, which is an error that may not get caught at all. Fixing this didn't improve our end2end time at all but it did take away the most annoying manual touchpoint, which we were all very happy about. After this change the automation stayed relatively stable for the next 6 months, with only minor bugfixes happening.
Late 2010 to early 2011
At the end of 2010 and start of 2011 we began a huge round of upgrades and optimizations starting with upgrading to a new version of Buildbot. This work wasn't shinyfun, but long overdue after the regular Continuous Integration infrastructure had upgraded many months prior. After that was done some of us spent the next couple of months working hard on some new automation & improvements. This was one of the most exciting and active times for the release-automation. We lowered end2end time by parallelizing some things, we took away many manual touchpoints with new pieces of automation, and we dramatically improved stability through intelligent retrying of failed operations. Also of note is that went back to a model of having standalone scripts doing work and having Buildbot drive those, not unlike the Buildbot+Bootstrap era. This came about after having a lot of challenges implementing some things directly in Buildbot code, which makes it very difficult to make decisions at runtime, and the feeling that we didn't want to tie ourselves to Buildbot forever.Source Code Tagging
At the time, source repository tagging was one of the rougher parts of the automation. Not only did it often fail to push tags back to a repository due to losing a push race, but load issues caused us to get server side errors. For a period of time it was rare that a release *didn't* have a problem in tagging. Moving the tagging code to an external script made fixing these errors a lot easier. At the same time, we were able to start building up some very useful libraries for working with Mercurial, retrying failed commands, and other things. Since these changes have landed it's been very rare to have issues with tagging, and most of them have been regressions from recently landed things rather than long standing bugs with the tagging scripts.
L10n Repacks
We used to have similar issues with our l10n repacking logic, too. Sometimes the jobs would die while trying to clone a repository or when trying to download or upload a build. Additionally, we used to use a different Buildbot job for each locale, which meant that we would redo steps like "clone/pull from source repository" for every single locale which was quite inefficient. As you may have guessed, we did a similar thing to fix these issues: moved them to a script! Because of the earlier work done with tagging we were able to get retrying of repository cloning for free, and easily add retrying of uploads/downloads. This script also introduced another new technique to the release-automation: chunking (which was shamelessly ripped off of the Mochitest harness). Rather than have 1 Buildbot job for every single locale, the script knows how to compute the overall set of work for all locales and pick a chunk of it to work on.
Automated E-mail
Every release requires a lot of coordination, particularly with Release Drivers and QA. We need to send mail notifications when the release-automation starts, when each platform's en-US build is complete, when each platform's l10n repacks are complete, when updates are ready for testing, and some other events, too. It used to be that the Release Engineer responsible for the release would actively watch the jobs on a Buildbot display and send mail by hand as the jobs completed. Especially as we started doing releases more often, this became extremely tedious and distracting. It also caused artificial delays of up to 8 hours (in the worst case)! By automating these mails we massively reduced manual touchpoints, became more consistent with the messages we sent, allowed Release Engineers to more easily do other work mid-release, and in some extreme cases reduced end2end time of a release by multiple hours. Looking back on it this was one of the most important changes we've ever made, and certainly had the best cost/benefit ratio.Pushing to Mirrors et. al
When we push a Firefox release out to the mirror network we get past the point of no return. Once it's out there, we have no way to pull it back and no way to guarantee that we overwrite all of the files on all of the mirrors in a timely manner. If we find bugs past that point we have to increment the version number and start again. Because of that we do a full antivirus check and verification of all permissions prior to pushing (in addition to all of the testing that QA already does). These used to be done all by hand - a Release Engineer would log onto a machine at some point between builds being available and prior to pushing, run some commands, and wait. Besides the annoyance of doing it by hand, we would sometimes forget to do this in advance of the release. When that happened these things all of a sudden were in the critical path, and holding up the release. To address both of those issues these checks were automated and done immediately after all release files became available. At the same time we partly automated the mirror push itself. Pushing to mirrors involves running a command like:rsync -av --exclude=*tests* --exclude=*crashreporter* --exclude=*.log --exclude=*.txt --exclude=*unsigned* --exclude=*update-backup* --exclude=*partner-repacks* --exclude=*.checksums --exclude=logs --exclude=jsshell* --exclude=*/*.asc /pub/mozilla.org/firefox/nightly/10.0.2-candidates/build1/ /pub/mozilla.org/firefox/releases/10.0.2/
With such a non-trivial thing being required every time it's easy to make mistake, so once again, automating is a clear way to reduce manual error.