Tag Archives: Release Automation

More on “How far we’ve come”

After I posted “How far we’ve come” this morning a few people expressed interest in what our release process looked like before, and what it looks like now.

The earliest recorded release process I know of was called the “Unified Release Process”. (I presume “unified” comes from unifying the ways different release engineers did things.) As you can see, it’s a very lengthy document, with lots of shell commands to tweak/copy/paste. A lot of the things that get run are actually scripts that wrap some parts of the process – so it’s not as bad as it could’ve been.

I was around for much of the improvements to this process. Awhile back I wrote a series of blog posts detailing some of them. For those interested, you can find them here:

I haven’t gotten around to writing a new one for the most recent version of the release automation, but if you compare our current Checklist to the old Unified Release Process, I’m sure you can get a sense of how much more efficient it is. Basically, we have push-button releases now. Fill in some basic info, push a button, and a release pops out:

How far we’ve come

When I joined Mozilla’s Release Engineering team (Build & Release at the time) back in 2007, the mechanics of shipping a release were a daunting task with zero automation. My earliest memories of doing releases are ones where I get up early, stay late, and spend my entire day on the release. I logged onto at least 8 different machines to run countless commands, sometimes forgetting to start “screen” and losing work due to a dropped network connection.

Last night I had a chat with Nick. When we ended the call I realized that the Firefox 30.0 release builds had started mid-call – completely without us. When I checked my e-mail this morning I found that the rest of the release build process had completed without issue or human intervention.

It’s easy to get bogged down thinking about current problems. Times like this make me realize that sometimes you just need to sit down and recognize how far you’ve come.

Contribution opportunity: Release Engineering systems

Release Engineering runs a vast array of infrastructure and systems that do much of the continuous integration and releases for Mozilla. Many of our systems are small in their scope but must be able to scale up to support the incredible load that developers put on them. Other systems receive millions of requests every day from live Firefox, Fennec, and Thunderbird installations.

Do you want help developer productivity or get releases into users hands more quickly and efficiently? Do you want to gain experience working on systems that must work at scale? If so, Release Engineering is a great place to look. Below are a few interesting bugs that could use some attention. If you’re interested in working on any of them I’m interested in mentoring you. You should be familiar with Python, but you don’t need to be an expert. Have a look below and contact me directly if anything interests you.

  • Partial update generation service: Arguably, updates are the most important part of release process. Partial updates in particular help us keep a good user experience by reducing the amount of data a user needs to download, which means they update more quickly. We generate many of these already but creating this service would allow much more flexibility over what and when we generate partial updates. This project would involve writing the service from scratch, most likely in Python.
  • Update Balrog schema to support multiple partials: Balrog is the code name of our new update server (which I’ve previously blogged about). It’s original design came about before we supported serving partial updates to users on multiple older versions of Firefox. In order to start using Balrog for Betas and Releases we need to add this feature. Balrog is written in Python and this will mostly involve server side changes to it.
  • Improve update verify output: “Update verify” is a very important test that we run as part of our release automation. It’s job is to make sure that all users, regardless of where they’re coming from, end up in the same state after updating to the latest release. It’s output currently consists of thousands and thousands of lines of text, with test results interspersed. This bug is about finding and implementing a way to make the output easier for a human to make sense of and parse upon failure. The update verify scripts are written in bash, but this could be implemented by modifying them or post-processing the output.
  • Store history of machine actions requested through API: We recently deployed a new system that helps us manage our thousands of build and test machines. It aims to be a single entry point for information gathering and common operations on them. Currently, the data in it is volatile — all history of operations is lost when the server is restarted. This bug will involve adding permanent storage (maybe SQL, maybe something else) to that server, which is written in Python.

Smaller & faster updates now accessible to more Firefox users

When a user receives an update to Firefox they get either a partial or complete. A complete is nearly identical to its associated installer, and can update any old Firefox version. A partial is a binary diff of a specific old version against a newer one and only compatible with that specific old version. The size difference a partial and a complete can be huge. For example, the complete MAR for 14.0.1, en-US, win32 was 20MB. The partial from 13.0.1 was 7.4MB (even smaller on other platforms, where PGO doesn’t make diffing hard).

Until recently, we’ve only been able to produce partials against a single old version without a lot of extra time consuming and error prone manual work — meaning that a lot of users who could benefit from a partial weren’t receiving them.

With bug 773290 (multiple partial MAR support in release automation) resolved, we can now offer partial updates to many versions without the risk and time cost of doing them by hand. When we shipped 15.0 we offered partial updates to users on 14.0.1, 13.0.1, and 12.0 – which collectively represented just over 75% of our installed userbase. For future releases it’s possible we’ll offer partials to even more previous versions.

I’m sure some of you are asking why we can’t just do partial updates for ALL old releases. There’s a couple of reasons for that. Most importantly, there’s big diminishing returns on partial updates. The 13.0.1 -> 15.0 partial was 12MB, the 12.0 -> 15.0 partial was 14MB, and the complete was 20MB. The further you go back the less you gain from getting a partial. Secondly, computing partial updates is not computationally cheap. We ship 88 locales across 4 platforms — this works out to about 350 partials that need to be calculated. This can be made cheaper through caching and parallelization, but it still ends up adding about 45min to the running time of the release automation for every extra version we want partials for. In turn, this delays QA and other things in the critical path of shipping.

I also want to point out that this work does NOT apply to Nightly or Aurora. We have no plans to offer multiple partial updates on those channels at this time. Due to their relatively low userbase and very high frequency of change (almost every 24h), the cost/benefit just doesn’t work. However, we will be looking at using this on Beta where the userbase is much larger and the rate of change slower (about once a week).

A huge thanks goes out to Rail Aliev and Nick Thomas, who helped work out the design, wrote some parts of it, and provided reviews. We couldn’t have had this ready for 15.0 without their help.

Release Automation – Part 3: Improvements & Optimizations (2009 to early 2011)

In my last post I talked about the major project of switching the Release Automation from Bootstrap driven by Buildbot to being directly implemented in Buildbot, and working out of Mercurial. After a 6 month break from automation work, there were a few spurts of development on the automation over the course of the next two years. Some of these were big new things, like Fennec automation, while others were deliberate attempts to improve the automation. This post will cover the most important changes that happened from late 2009 all the way through early 2011.

Late 2009 to mid 2010

Fennec Release Automation

In 2009 Mozilla began working on a version of Firefox for Maemo. Late in that year, we shipped 1.0rc1 with the Release Automation. Some people may be thinking “that doesn’t sound very hard, it’s just another platform right?”. Unfortunately, there’s a lot of hidden complexity with adding a new platform that doesn’t confirm to long-held assumptions, like mobile. While the actual build process is fairly similar there’s a lot of pre-build, post-build, and other things that just aren’t the same. Fennec was the first product we supported that was built out of multiple source repositories, which not only caused problems for builds (and isn’t handled well by Buildbot), but affected how we tag repositories and generate source tarballs. L10n repacks were also completely different for Fennec: not only did we ship individual l10n builds for many locales, but we also shipped builds with multiple locales in them. Doing this meant build process changes as well as a new format to describe locales, revisions, and which types of repacks each one needed. All of this combined ended up being nearly a month of work (and many late nights, Aki tells me) to get up and running! This was the first product we’ve ever shipped that had automated releases from the start, which is a huge accomplishment for forwarding thinking & planning – something that we simply didn’t have time for in the past. It’s hard to determine how many hours of end2end time and # of manual touchpoints this saved since it was never manual work to begin with but there’s no doubt that we’re far better off with it than without.

Major Update

In the latter half of 2009 we started doing a lot of Major Updates. That is, offering 3.0.x users an update to a 3.5.x release. Behind the scenes, each Major Update offer took approximately 4 hours to create and had at least 6 or 7 manual touchpoints in order to do config file bumping, snippet generation, test snippet pushing, and verification of those snippets. Each one had an end2end time of 4 hours or so and had at least 6 or 7 manual touchpoints. If that wasn’t bad enough, a single mistake in the configuration file would cause us to have to restart the entire process! Automating this turned out to be one of the easier pieces of new automation because of how similar Major Updates were to the regular updates we already did with every release. When this relatively simple work was done, all of the manual touchpoints were gone completely and because these were now done automatically with a release instead of out of band they moved out of the critical path and therefore had no end2end time impact either! This is always the best kind of new automation =).

Bouncer Entries

In mid-2010 we automated a long standing annoyance: Bouncer entry creation. Like the Major Updates, this was something that was subject to manual error. More importantly, it was _damn_ annoying to do. Bouncer is the piece of software that powers download.mozilla.org, which redirects download requests to our mirror network. Each time we release we need to tell it where to find the files we ship. This translates to one entry for each installer, complete MAR, and partial MAR for each platform. Prior to this being fixed this was done mostly through copy and paste which has a massive margin for error. In the best case scenario this means we’ll get some 404s, which are easy to detect and fix. In the worst case we could point at the wrong release entirely, which is an error that may not get caught at all. Fixing this didn’t improve our end2end time at all but it did take away the most annoying manual touchpoint, which we were all very happy about.

After this change the automation stayed relatively stable for the next 6 months, with only minor bugfixes happening.

Late 2010 to early 2011

At the end of 2010 and start of 2011 we began a huge round of upgrades and optimizations starting with upgrading to a new version of Buildbot. This work wasn’t shinyfun, but long overdue after the regular Continuous Integration infrastructure had upgraded many months prior.

After that was done some of us spent the next couple of months working hard on some new automation & improvements. This was one of the most exciting and active times for the Release Automation. We lowered end2end time by parallelizing some things, we took away many manual touchpoints with new pieces of automation, and we dramatically improved stability through intelligent retrying of failed operations. Also of note is that went back to a model of having standalone scripts doing work and having Buildbot drive those, not unlike the Buildbot+Bootstrap era. This came about after having a lot of challenges implementing some things directly in Buildbot code, which makes it very difficult to make decisions at runtime, and the feeling that we didn’t want to tie ourselves to Buildbot forever.

Source Code Tagging

At the time, source repository tagging was one of the rougher parts of the automation. Not only did it often fail to push tags back to a repository due to losing a push race, but load issues caused us to get server side errors. For a period of time it was rare that a release *didn’t* have a problem in tagging. Moving the tagging code to an external script made fixing these errors a lot easier. At the same time, we were able to start building up some very useful libraries for working with Mercurial, retrying failed commands, and other things. Since these changes have landed it’s been very rare to have issues with tagging, and most of them have been regressions from recently landed things rather than long standing bugs with the tagging scripts.

L10n Repacks

We used to have similar issues with our l10n repacking logic, too. Sometimes the jobs would die while trying to clone a repository or when trying to download or upload a build. Additionally, we used to use a different Buildbot job for each locale, which meant that we would redo steps like “clone/pull from source repository” for every single locale which was quite inefficient. As you may have guessed, we did a similar thing to fix these issues: moved them to a script! Because of the earlier work done with tagging we were able to get retrying of repository cloning for free, and easily add retrying of uploads/downloads. This script also introduced another new technique to the Release Automation: chunking (which was shamelessly ripped off of the Mochitest harness). Rather than have 1 Buildbot job for every single locale, the script knows how to compute the overall set of work for all locales and pick a chunk of it to work on.

Automated E-mail

Every release requires a lot of coordination, particularly with Release Drivers and QA. We need to send mail notifications when the Release Automation starts, when each platform’s en-US build is complete, when each platform’s l10n repacks are complete, when updates are ready for testing, and some other events, too. It used to be that the Release Engineer responsible for the release would actively watch the jobs on a Buildbot display and send mail by hand as the jobs completed. Especially as we started doing releases more often, this became extremely tedious and distracting. It also caused artificial delays of up to 8 hours (in the worst case)! By automating these mails we massively reduced manual touchpoints, became more consistent with the messages we sent, allowed Release Engineers to more easily do other work mid-release, and in some extreme cases reduced end2end time of a release by multiple hours. Looking back on it this was one of the most important changes we’ve ever made, and certainly had the best cost/benefit ratio.

Pushing to Mirrors et. al

When we push a Firefox release out to the mirror network we get past the point of no return. Once it’s out there, we have no way to pull it back and no way to guarantee that we overwrite all of the files on all of the mirrors in a timely manner. If we find bugs past that point we have to increment the version number and start again. Because of that we do a full antivirus check and verification of all permissions prior to pushing (in addition to all of the testing that QA already does). These used to be done all by hand – a Release Engineer would log onto a machine at some point between builds being available and prior to pushing, run some commands, and wait. Besides the annoyance of doing it by hand, we would sometimes forget to do this in advance of the release. When that happened these things all of a sudden were in the critical path, and holding up the release. To address both of those issues these checks were automated and done immediately after all release files became available. At the same time we partly automated the mirror push itself. Pushing to mirrors involves running a command like:
rsync -av --exclude=*tests* --exclude=*crashreporter* --exclude=*.log --exclude=*.txt --exclude=*unsigned* --exclude=*update-backup* --exclude=*partner-repacks* --exclude=*.checksums --exclude=logs --exclude=jsshell* --exclude=*/*.asc /pub/mozilla.org/firefox/nightly/10.0.2-candidates/build1/ /pub/mozilla.org/firefox/releases/10.0.2/
With such a non-trivial thing being required every time it’s easy to make mistake, so once again, automating is a clear way to reduce manual error.

Autosign

All of the Firefox builds that we distribute are signed in some manner. On Windows, we have Authenticode Signatures; for everything else we have detached GPG signatures. Signing our builds is a crucial part of the release process and right in the middle of the critical path. Because we ship Firefox in so many languages and on multiple platforms it can take awhile to do all of our signing, which means it’s important to get it started as soon as possible. In the past, we had to wait for all builds & repacks to complete and then run a long series of manual commands on our signing machine to: download the builds, sign them, verify them, and upload the signed bits. This was OK for awhile, but as we started shipping in more languages on more platforms it became horribly inefficient; downloading the builds alone started to take 30 minutes or more. And again, like many other things, there was lots of opportunity for manual error. Enter: Autosign. This relatively simple improvement adjusted the existing signing logic to be able to detect when it had all of the required bits to start signing. This meant that we could run the commands that would start signing as soon as the release began. The scripts continually download builds in a loop, in parallel with the rest of the automation running, which means we completely remove the “download builds” part of the signing process from the critical path. This also means that the Release Engineer doesn’t need to be at work or even awake when all of the builds & repacks complete. In some cases, just like automated e-mail, this can save multiple hours of end2end time.

Summary

The combination of all of the changes above took the automation from a moderately fast system that worked most of the time to a very speedy system that rarely fails. Nearly everyone in Release Engineering had a hand in this, and most of them were done over a two month period!

Incredibly, there was still more we found to improve in the following year, which I’ll talk about in Part 4!

Release Automation – Part 2: Mercurial-based, v1

Around the start of 2008 Mozilla moved Firefox and Gecko development from CVS to Mercurial, with Firefox 3.5 (nee 3.1) as the first release out of the new repository. In addition to that, the underlying build infrastructure had switched from being Tinderbox driven, to being Buildbot driven – which made some of the existing release automation useless. In mid-2008 we started planning to port, rework, and update the release automation for this new environment. The 2008 Firefox Summit conveniently happened right around this time, so we took that opportunity to gather a quorum on the subject and go over all the plans in detail. By the end of the night (and end of the beer, if I recall correctly), we had discussed everything to death a tracking bug.

This version of the automation struck a balance between improving the overall design of the system and simply doing straight porting work. The plain porting isn’t very interesting, so I’ll be mostly focusing on the improvements we made in this post.

One of the bigger optimizations we made to to generate files in their final location at build time. In the Bootstrap days we uploaded files to flat directories with long filenames, and then re-arranged them into their final layout later on in the process. With this change made our candidates directories looked a lot more like the the associated release directory. This may not sound like a huge change but it cut our disk space usage per release in half or more, shaved over an hour off the end2end time of the release, and let us put our release file naming logic into the build system, where it more rightly belonged. It also allowed us to make the next optimization: combining the signing processes.

In the Bootstrap and pre-Bootstrap worlds we had two separate signing processes: one to sign the internal guts of Firefox win32 builds (firefox.exe, xul.dll, et. al) and one to sign the Firefox installers themselves. Early on, we signed the internal bits and handed them off to QA. Closer to release time, we signed the installers themselves and generated GPG signatures for all files. The only reason I can think of why we would do this is to keep signed installers out of public directories until we’re sure we’ll be releasing them. This isn’t without its drawbacks though. Leaving this until later in the process added unnecessary manual touchpoints, put non-trivial work late in the critical path, and worst of all: It meant QA did not test the exact bits that we shipped to users! (We actually managed to ship unsigned installers once, which isn’t possible anymore.) Improving this only required a small rework of our existing signing scripts (and lots of testing, of course!) but it took another 1-2h off of our end2end time and removed another manual touchpoint.

It’s also worth noting that merely by switching to Mercurial we saved over half an hour in end2end time in tagging. In CVS, we had to create a branch and tag thousands and thousands of files with multiple tags, which takes a very long time. In Mercurial, we have clone a repository, which takes some time, but the tagging itself is near-instant.

In addition to the optimizations noted above, tons of work was done porting the existing automation. Many things had to be pulled out of Bootstrap and put into their own scripts to make them usable by both versions of the automation; en-US builds and l10n repacks had to be reimplemented entirely in Buildbot; and some other things that couldn’t be pulled out of Bootstrap had to be reimplemented as well. It was a very large undertaking that was primarily worked on by Nick Thomas, Coop, and myself and took months to complete.

Firefox 3.1b3 was the first fully automated release with this automation. By the time we worked out most of the kinks we were at end2end time of 8-10h and about 12 manual touchpoints.

Next up: Various improvements & optimizations (not as boring as it sounds, I promise!)

Release Automation – Part 1: Bootstrap

One of the first tasks I had as a full-time employee of Mozilla was getting the Bootstrap Release framework working with Firefox 3.0 Beta releases. Now, just over 4 years later, our Release Automation has changed dramatically in many ways: primary language, supported platforms, scope and extent, reliability, and versatility. I thought it made be interesting to trace the path from there to here, and talk about what’s in store for the future, too. Throughout all of this work there’s been two overarching goals: 1) Lower the time it takes to go from “go to build” to “updates available for testing” – which we call “end2end time”, and 2) Remove the number of machines we have to log into, commands we have to run, and active time we have to spend on a release – known as “manual touchpoints”. I’ll be referencing these a lot throughout this series.

This post will talk about what I know of Bootstrap and my work porting it to Firefox 3.0.

In its earliest form Bootstrap was a simple scripted version of much of the previously manual release process. The processes for tagging VCS repositories, creating deliverables (source packages, en-US and localized builds, updates), and some verifications were encapsulated into its scripts. This was a big improvement over the 100% manual, cut+paste-from-a-wiki, process. Instead of logging into many machines and running many commands, the release engineer had to log in to many machines and run a few, very simple commands. The very first release that was Bootstrap-aided was Firefox 1.5.0.9, built on December 6th, 2006. This was before my time, but a former release engineer, Rob Helmer, told me that the end2end time back then could be multiple days, and countless touchpoints.

Over time, more parts of the release process were automated with Bootstrap, further reducing the burden on the release engineer. Even with these big improvements some classes of things were still not codified: which machines to run which commands on, when and in what order to run things, who to notify about what. Enter: Buildbot. Integrating Bootstrap into Buildbot was the next logical step in the process. It would handle scheduling and status, while Bootstrap would remain responsible for all of implementation. With this, the release engineer only had to log in to a few machines and run a few, very simple commands. Another big improvement! The first release to benefit from this was Firefox 2.0.0.8, built on October 10th, 2007. This work was largely done by Rob Helmer.

Around this time we were gearing up to start shipping the first Firefox 3.0 Beta release and had never tested Bootstrap against that development branch. I was tasked with making whatever changes were necessary to Bootstrap and our Buildbot to make it work. The Buildbot side was largely simple, because of it being at such a high abstraction layer, but back in these days we still had single purpose Buildbot masters, so it involved adding several hundred lines of config code.

The Bootstrap side was far more interesting. Until this point, there was a lot of built-in assumptions based on what the 1.8 branch looked like, including:

  • Releases are done from CVS branches (explicitly _not_ trunk)
  • Windows build machines run Cygwin
  • Linux packages are in .gz format
  • The crash reporting system Talkback is always shipped

By themselves, none of these things are too challenging to deal with, but as a very new hire, the combination took me about a month to find solutions to and fully test, with many rounds of feedback and guidance along the way. With all of that done and landed, we managed to use the new automation to build Firefox 3.0b2 on December 10, 2007. At this point, the end2end time was around 24h and there were about 20 manual touchpoints.

Over the next 8 months or so there were a few major improvements of note. Firstly, Nick Thomas fixed bug 409394 (Support for long version names) allowed us to start shipping releases with nicer looking filenames like “Firefox Setup 3.0 Beta 4″. Not a crucial thing, but much nicer from the user perspective. bug 422235 (enable fast patcher for release automation) was a massive improvement in update generation, written by schrep. With this work, we went from taking 6-8 hours to generate updates, down to ~1h — an incredible savings in time. Finally, bug 428063 (Support major releases & quit using rc in overloaded ways) (also fixed by Nick) enabled us to build RCs with Bootstrap. While it may sound simple, there’s a lot of things in release automation that depend on filename, and catching them all can be difficult. As well as making it possible to build these, this bug also renamed the internal “rc” notion to “build”, to avoid situations where we’d have things like “3.0 RC1 rc1″, which was utterly confusing.


So, in the early days there were tons of improvement quickly: Bootstrap itself sped things up and lowered the possibility of error through reducing manual touchpoints. Buildbot + Bootstrap did so again, through the same methods. We also had pure speed-ups through things such as fast patcher. Having these things allowed us to maintain the 2.0.0.x and 3.0.x branches more more easily, and get chemspill releases out quickly and simultaneously. All of this work had to be done incrementally too, because we had to continue shipping releases while the work was happening. It’s hard to find good data for releases done with this version of the automation, but I guesstimate that the end2end time was around 12-14 hours and the number of manual touchpoints was still around 20 for a release without major issues.

Next up….Release Automation on Mercurial, v1.