Code signing coming to Firefox Mac builds

A few weeks ago a new Developer Preview of OS X 10.8 was released and it was discovered that as things stand now, Firefox will not run on it. With the current default settings, 10.8 will not allow any software to run unless it's signed with an Apple Developer ID (essentially, a certificate issued by a particular Apple Root CA). We don't know exactly when 10.8 will be released to the public but some have speculated that it could be as early as the week of June 11th at WWDC 2012. We must have a signed and released Firefox out there before the general public starts upgrading and we've been working hard to make that happen as soon as possible. This post will give a short history of Mac signing at Mozilla and talk about our timeline for enabling it.

Background

Code signing of Mac builds has been on our radar for a long time. Bug 400296 was originally filed in 2007. In late 2010 Syed Albiz did a ton of great work figuring out the Apple tools and how we can integrate them into our automation. That work didn't quite get finished before his internship was completed and the bug stagnated for some time afterwards. At the start of this year there was renewed energy when Erick Dransch picked up the bug. We attempted to land his work and enable signing on nightlies in mid-April, but that ended up bouncing due to some conflicts with our upgrade to 10.7-based build machines. Erick's internship expired before everything could be fixed up, and the bug fell to me.

After gaining access to Mozilla's Apple Developer account on Monday there was a lot of early iteration before we got to the point where we could sign a build in a way that Mountain Lion liked. There's multiple certificates types that one can get from Apple ("Development Certificate", "Mac App Certificate", "Developer ID Certificate") and multiple versions of OS X and XCode (each with their own quirks) that one can sign with. Mostly thanks to Steven Michaud's knowledge and assistance we figured out exactly what combination of these we'll need to use to have signed Firefox builds that work everywhere.

Where we're at now

At this point in time we've got all the tools we need to sign all Mac Firefox builds. The only blocking issue at this point is figuring out access restrictions to our Apple Developer Account, so that we can generate our final Developer ID certificates.

Endgoal

Like with Windows Authenticode signing we will have 3 different certificates for different types of Firefox builds. Dep and try builds are at the lowest level of trust and have no regular users and therefore will be signed with a self signed certificate. This means that they will not run on 10.8 unless the user has allowed "applications downloaded from anywhere" to run (which is not the default). Nightly and Aurora are at an elevated level of trust and have a userbase. These will be signed with their own Developer ID certificate. Finally, Beta and Release builds are at the highest level of trust and oversight, and represent the majority of our users. They will be signed with a separate Developer ID certificate. From a user standpoint, Nightly, Aurora, Beta and Release will all look the same but using separate certificates gives us some degree of isolation in terms of certificate revocation.

Timeline

I intend to have dep and try builds signed by the end of the week. After we figure out the access restrictions to our Developer Account we will turn on signing of Nightly builds, hopefully early/mid next week. After letting those settle for a day or two we will turn on signing of Aurora and Beta builds, hopefully by the end of next week.


If you're interested in the technical details of signing Mac builds Erick wrote an excellent blog post detailing the trials and tribulations of writing tools around them.

Using Authenticode Code Signing Certificates with OS X's Signing Tools

Our intern Erick has been doing some great work reviving, polishing and finalizing the patches that will allow us to start signing OS X builds (more to come on that in his blog!). When we do start signing them we're planning to use our existing set of code signing certificates for them rather than buy new ones. I thought it would be a simple task to convert them so I set off to convert our internal, self-generated ones. After hours and hours of head scratching and frustration I learned that some versions of Microsoft's "makecert" tool are broken, and generate invalid PKCS7 certs that openssl can't cope with properly. From the OpenSSL PKCS#12 FAQ:

Q. What are SPC files?

A. They are simply DER encoded PKCS#7 files containing the certificates. Well they are in the newer versions of the tools. The older versions used an invalid PKCS#7 format.

The end result of all my attempts ended up being a PKCS#12 certificate that Apple's codesign tool claimed couldn't be used to do code signing.

After finding that FAQ, I decided to try to convert our Nightly code signing certificate instead. Following the great instructions found on Marc Liyanage's blog I managed to successfully convert the certificate, import it into a Keychain, and successfully sign something! Here's the shortened version of what I did. Note that it requires the PVK tool found here:

~/pvk.exe -in Nightly.pvk -out Nightly.key.pem

openssl pkcs7 -inform der -print_certs Nightly.cert.pem

openssl pkcs12 -export -inkey Nightly.key.pem -in Nightly.cert.pem -out Nightly.p12

I hope this helps someone else avoid the same frustration!

release-automation - Part 3: Improvements & Optimizations (2009 to early 2011)

In my last post I talked about the major project of switching the release-automation from Bootstrap driven by Buildbot to being directly implemented in Buildbot, and working out of Mercurial. After a 6 month break from automation work, there were a few spurts of development on the automation over the course of the next two years. Some of these were big new things, like Fennec automation, while others were deliberate attempts to improve the automation. This post will cover the most important changes that happened from late 2009 all the way through early 2011.

Late 2009 to mid 2010

Fennec release-automation

In 2009 Mozilla began working on a version of Firefox for Maemo. Late in that year, we shipped 1.0rc1 with the release-automation. Some people may be thinking "that doesn't sound very hard, it's just another platform right?". Unfortunately, there's a lot of hidden complexity with adding a new platform that doesn't confirm to long-held assumptions, like mobile. While the actual build process is fairly similar there's a lot of pre-build, post-build, and other things that just aren't the same. Fennec was the first product we supported that was built out of multiple source repositories, which not only caused problems for builds (and isn't handled well by Buildbot), but affected how we tag repositories and generate source tarballs. L10n repacks were also completely different for Fennec: not only did we ship individual l10n builds for many locales, but we also shipped builds with multiple locales in them. Doing this meant build process changes as well as a new format to describe locales, revisions, and which types of repacks each one needed. All of this combined ended up being nearly a month of work (and many late nights, Aki tells me) to get up and running! This was the first product we've ever shipped that had automated releases from the start, which is a huge accomplishment for forwarding thinking & planning - something that we simply didn't have time for in the past. It's hard to determine how many hours of end2end time and # of manual touchpoints this saved since it was never manual work to begin with but there's no doubt that we're far better off with it than without.

Major Update

In the latter half of 2009 we started doing a lot of Major Updates. That is, offering 3.0.x users an update to a 3.5.x release. Behind the scenes, each Major Update offer took approximately 4 hours to create and had at least 6 or 7 manual touchpoints in order to do config file bumping, snippet generation, test snippet pushing, and verification of those snippets. Each one had an end2end time of 4 hours or so and had at least 6 or 7 manual touchpoints. If that wasn't bad enough, a single mistake in the configuration file would cause us to have to restart the entire process! Automating this turned out to be one of the easier pieces of new automation because of how similar Major Updates were to the regular updates we already did with every release. When this relatively simple work was done, all of the manual touchpoints were gone completely and because these were now done automatically with a release instead of out of band they moved out of the critical path and therefore had no end2end time impact either! This is always the best kind of new automation =).

Bouncer Entries

In mid-2010 we automated a long standing annoyance: Bouncer entry creation. Like the Major Updates, this was something that was subject to manual error. More importantly, it was _damn_ annoying to do. Bouncer is the piece of software that powers download.mozilla.org, which redirects download requests to our mirror network. Each time we release we need to tell it where to find the files we ship. This translates to one entry for each installer, complete MAR, and partial MAR for each platform. Prior to this being fixed this was done mostly through copy and paste which has a massive margin for error. In the best case scenario this means we'll get some 404s, which are easy to detect and fix. In the worst case we could point at the wrong release entirely, which is an error that may not get caught at all. Fixing this didn't improve our end2end time at all but it did take away the most annoying manual touchpoint, which we were all very happy about. After this change the automation stayed relatively stable for the next 6 months, with only minor bugfixes happening.

Late 2010 to early 2011

At the end of 2010 and start of 2011 we began a huge round of upgrades and optimizations starting with upgrading to a new version of Buildbot. This work wasn't shinyfun, but long overdue after the regular Continuous Integration infrastructure had upgraded many months prior.

After that was done some of us spent the next couple of months working hard on some new automation & improvements. This was one of the most exciting and active times for the release-automation. We lowered end2end time by parallelizing some things, we took away many manual touchpoints with new pieces of automation, and we dramatically improved stability through intelligent retrying of failed operations. Also of note is that went back to a model of having standalone scripts doing work and having Buildbot drive those, not unlike the Buildbot+Bootstrap era. This came about after having a lot of challenges implementing some things directly in Buildbot code, which makes it very difficult to make decisions at runtime, and the feeling that we didn't want to tie ourselves to Buildbot forever.

Source Code Tagging

At the time, source repository tagging was one of the rougher parts of the automation. Not only did it often fail to push tags back to a repository due to losing a push race, but load issues caused us to get server side errors. For a period of time it was rare that a release *didn't* have a problem in tagging. Moving the tagging code to an external script made fixing these errors a lot easier. At the same time, we were able to start building up some very useful libraries for working with Mercurial, retrying failed commands, and other things. Since these changes have landed it's been very rare to have issues with tagging, and most of them have been regressions from recently landed things rather than long standing bugs with the tagging scripts.

L10n Repacks

We used to have similar issues with our l10n repacking logic, too. Sometimes the jobs would die while trying to clone a repository or when trying to download or upload a build. Additionally, we used to use a different Buildbot job for each locale, which meant that we would redo steps like "clone/pull from source repository" for every single locale which was quite inefficient. As you may have guessed, we did a similar thing to fix these issues: moved them to a script! Because of the earlier work done with tagging we were able to get retrying of repository cloning for free, and easily add retrying of uploads/downloads. This script also introduced another new technique to the release-automation: chunking (which was shamelessly ripped off of the Mochitest harness). Rather than have 1 Buildbot job for every single locale, the script knows how to compute the overall set of work for all locales and pick a chunk of it to work on.

Automated E-mail

Every release requires a lot of coordination, particularly with Release Drivers and QA. We need to send mail notifications when the release-automation starts, when each platform's en-US build is complete, when each platform's l10n repacks are complete, when updates are ready for testing, and some other events, too. It used to be that the Release Engineer responsible for the release would actively watch the jobs on a Buildbot display and send mail by hand as the jobs completed. Especially as we started doing releases more often, this became extremely tedious and distracting. It also caused artificial delays of up to 8 hours (in the worst case)! By automating these mails we massively reduced manual touchpoints, became more consistent with the messages we sent, allowed Release Engineers to more easily do other work mid-release, and in some extreme cases reduced end2end time of a release by multiple hours. Looking back on it this was one of the most important changes we've ever made, and certainly had the best cost/benefit ratio.

Pushing to Mirrors et. al

When we push a Firefox release out to the mirror network we get past the point of no return. Once it's out there, we have no way to pull it back and no way to guarantee that we overwrite all of the files on all of the mirrors in a timely manner. If we find bugs past that point we have to increment the version number and start again. Because of that we do a full antivirus check and verification of all permissions prior to pushing (in addition to all of the testing that QA already does). These used to be done all by hand - a Release Engineer would log onto a machine at some point between builds being available and prior to pushing, run some commands, and wait. Besides the annoyance of doing it by hand, we would sometimes forget to do this in advance of the release. When that happened these things all of a sudden were in the critical path, and holding up the release. To address both of those issues these checks were automated and done immediately after all release files became available. At the same time we partly automated the mirror push itself. Pushing to mirrors involves running a command like:

rsync -av --exclude=tests --exclude=crashreporter --exclude=.log --exclude=.txt --exclude=unsigned --exclude=update-backup --exclude=partner-repacks --exclude=.checksums --exclude=logs --exclude=jsshell --exclude=/.asc /pub/mozilla.org/firefox/nightly/10.0.2-candidates/build1/ /pub/mozilla.org/firefox/releases/10.0.2/

With such a non-trivial thing being required every time it's easy to make mistake, so once again, automating is a clear way to reduce manual error.

Autosign

All of the Firefox builds that we distribute are signed in some manner. On Windows, we have Authenticode Signatures; for everything else we have detached GPG signatures. Signing our builds is a crucial part of the release process and right in the middle of the critical path. Because we ship Firefox in so many languages and on multiple platforms it can take awhile to do all of our signing, which means it's important to get it started as soon as possible. In the past, we had to wait for all builds & repacks to complete and then run a long series of manual commands on our signing machine to: download the builds, sign them, verify them, and upload the signed bits. This was OK for awhile, but as we started shipping in more languages on more platforms it became horribly inefficient; downloading the builds alone started to take 30 minutes or more. And again, like many other things, there was lots of opportunity for manual error. Enter: Autosign. This relatively simple improvement adjusted the existing signing logic to be able to detect when it had all of the required bits to start signing. This meant that we could run the commands that would start signing as soon as the release began. The scripts continually download builds in a loop, in parallel with the rest of the automation running, which means we completely remove the "download builds" part of the signing process from the critical path. This also means that the Release Engineer doesn't need to be at work or even awake when all of the builds & repacks complete. In some cases, just like automated e-mail, this can save multiple hours of end2end time.

Summary

The combination of all of the changes above took the automation from a moderately fast system that worked most of the time to a very speedy system that rarely fails. Nearly everyone in Release Engineering had a hand in this, and most of them were done over a two month period!

Incredibly, there was still more we found to improve in the following year, which I'll talk about in Part 4!

release-automation - Part 2: Mercurial-based, v1

Around the start of 2008 Mozilla moved Firefox and Gecko development from CVS to Mercurial, with Firefox 3.5 (nee 3.1) as the first release out of the new repository. In addition to that, the underlying build infrastructure had switched from being Tinderbox driven, to being Buildbot driven - which made some of the existing release automation useless. In mid-2008 we started planning to port, rework, and update the release automation for this new environment. The 2008 Firefox Summit conveniently happened right around this time, so we took that opportunity to gather a quorum on the subject and go over all the plans in detail. By the end of the night (and end of the beer, if I recall correctly), we had discussed everything to death a tracking bug.

This version of the automation struck a balance between improving the overall design of the system and simply doing straight porting work. The plain porting isn't very interesting, so I'll be mostly focusing on the improvements we made in this post.

One of the bigger optimizations we made to to generate files in their final location at build time. In the Bootstrap days we uploaded files to flat directories with long filenames, and then re-arranged them into their final layout later on in the process. With this change made our candidates directories looked a lot more like the the associated release directory. This may not sound like a huge change but it cut our disk space usage per release in half or more, shaved over an hour off the end2end time of the release, and let us put our release file naming logic into the build system, where it more rightly belonged. It also allowed us to make the next optimization: combining the signing processes.

In the Bootstrap and pre-Bootstrap worlds we had two separate signing processes: one to sign the internal guts of Firefox win32 builds (firefox.exe, xul.dll, et. al) and one to sign the Firefox installers themselves. Early on, we signed the internal bits and handed them off to QA. Closer to release time, we signed the installers themselves and generated GPG signatures for all files. The only reason I can think of why we would do this is to keep signed installers out of public directories until we're sure we'll be releasing them. This isn't without its drawbacks though. Leaving this until later in the process added unnecessary manual touchpoints, put non-trivial work late in the critical path, and worst of all: It meant QA did not test the exact bits that we shipped to users! (We actually managed to ship unsigned installers once, which isn't possible anymore.) Improving this only required a small rework of our existing signing scripts (and lots of testing, of course!) but it took another 1-2h off of our end2end time and removed another manual touchpoint.

It's also worth noting that merely by switching to Mercurial we saved over half an hour in end2end time in tagging. In CVS, we had to create a branch and tag thousands and thousands of files with multiple tags, which takes a very long time. In Mercurial, we have clone a repository, which takes some time, but the tagging itself is near-instant.

In addition to the optimizations noted above, tons of work was done porting the existing automation. Many things had to be pulled out of Bootstrap and put into their own scripts to make them usable by both versions of the automation; en-US builds and l10n repacks had to be reimplemented entirely in Buildbot; and some other things that couldn't be pulled out of Bootstrap had to be reimplemented as well. It was a very large undertaking that was primarily worked on by Nick Thomas, Coop, and myself and took months to complete.

Firefox 3.1b3 was the first fully automated release with this automation. By the time we worked out most of the kinks we were at end2end time of 8-10h and about 12 manual touchpoints.

Next up: Various improvements & optimizations (not as boring as it sounds, I promise!)

release-automation - Part 1: Bootstrap

One of the first tasks I had as a full-time employee of Mozilla was getting the Bootstrap Release framework working with Firefox 3.0 Beta releases. Now, just over 4 years later, our release-automation has changed dramatically in many ways: primary language, supported platforms, scope and extent, reliability, and versatility. I thought it made be interesting to trace the path from there to here, and talk about what's in store for the future, too. Throughout all of this work there's been two overarching goals: 1) Lower the time it takes to go from "go to build" to "updates available for testing" - which we call "end2end time", and 2) Remove the number of machines we have to log into, commands we have to run, and active time we have to spend on a release - known as "manual touchpoints". I'll be referencing these a lot throughout this series.

This post will talk about what I know of Bootstrap and my work porting it to Firefox 3.0.

In its earliest form Bootstrap was a simple scripted version of much of the previously manual release process. The processes for tagging VCS repositories, creating deliverables (source packages, en-US and localized builds, updates), and some verifications were encapsulated into its scripts. This was a big improvement over the 100% manual, cut+paste-from-a-wiki, process. Instead of logging into many machines and running many commands, the release engineer had to log in to many machines and run a few, very simple commands. The very first release that was Bootstrap-aided was Firefox 1.5.0.9, built on December 6th, 2006. This was before my time, but a former release engineer, Rob Helmer, told me that the end2end time back then could be multiple days, and countless touchpoints.

Over time, more parts of the release process were automated with Bootstrap, further reducing the burden on the release engineer. Even with these big improvements some classes of things were still not codified: which machines to run which commands on, when and in what order to run things, who to notify about what. Enter: Buildbot. Integrating Bootstrap into Buildbot was the next logical step in the process. It would handle scheduling and status, while Bootstrap would remain responsible for all of implementation. With this, the release engineer only had to log in to a few machines and run a few, very simple commands. Another big improvement! The first release to benefit from this was Firefox 2.0.0.8, built on October 10th, 2007. This work was largely done by Rob Helmer.

Around this time we were gearing up to start shipping the first Firefox 3.0 Beta release and had never tested Bootstrap against that development branch. I was tasked with making whatever changes were necessary to Bootstrap and our Buildbot to make it work. The Buildbot side was largely simple, because of it being at such a high abstraction layer, but back in these days we still had single purpose Buildbot masters, so it involved adding several hundred lines of config code.

The Bootstrap side was far more interesting. Until this point, there was a lot of built-in assumptions based on what the 1.8 branch looked like, including:

  • Releases are done from CVS branches (explicitly _not_ trunk)
  • Windows build machines run Cygwin
  • Linux packages are in .gz format
  • The crash reporting system Talkback is always shipped

By themselves, none of these things are too challenging to deal with, but as a very new hire, the combination took me about a month to find solutions to and fully test, with many rounds of feedback and guidance along the way. With all of that done and landed, we managed to use the new automation to build Firefox 3.0b2 on December 10, 2007. At this point, the end2end time was around 24h and there were about 20 manual touchpoints.

Over the next 8 months or so there were a few major improvements of note. Firstly, Nick Thomas fixed bug 409394 (Support for long version names) allowed us to start shipping releases with nicer looking filenames like "Firefox Setup 3.0 Beta 4". Not a crucial thing, but much nicer from the user perspective. bug 422235 (enable fast patcher for release automation) was a massive improvement in update generation, written by schrep. With this work, we went from taking 6-8 hours to generate updates, down to ~1h -- an incredible savings in time. Finally, bug 428063 (Support major releases & quit using rc in overloaded ways) (also fixed by Nick) enabled us to build RCs with Bootstrap. While it may sound simple, there's a lot of things in release automation that depend on filename, and catching them all can be difficult. As well as making it possible to build these, this bug also renamed the internal "rc" notion to "build", to avoid situations where we'd have things like "3.0 RC1 rc1", which was utterly confusing.


So, in the early days there were tons of improvement quickly: Bootstrap itself sped things up and lowered the possibility of error through reducing manual touchpoints. Buildbot + Bootstrap did so again, through the same methods. We also had pure speed-ups through things such as fast patcher. Having these things allowed us to maintain the 2.0.0.x and 3.0.x branches more more easily, and get chemspill releases out quickly and simultaneously. All of this work had to be done incrementally too, because we had to continue shipping releases while the work was happening. It's hard to find good data for releases done with this version of the automation, but I guesstimate that the end2end time was around 12-14 hours and the number of manual touchpoints was still around 20 for a release without major issues.

Next up....release-automation on Mercurial, v1.

Removed symlinks for dead branches on FTP (Firefox only)

Since the Firefox 2.0 days we've had "latest-X.Y" symlinks on FTP for all major versions of Firefox. With rapid release, this has quickly caused an explosion in the number of them, cluttering things up. In bug 689936 I removed all of the ones for dead branches (2.0, 3.0, 3.5), and also all for rapid releases (4.0, 5.0, 6.0, 7.0). From now on, there will be no new branch based symlinks, simply a "latest" symlink that points to the latest rapid release.

New tests coming to opt builds and l10n repacks

For a couple of years now we've been building Firefox release builds using slightly different packaging targets than nightly builds. This has streamlined our release automation in some ways, but has had the unfortunate side effect of release build packaging targets being largely untested. This week, we will start correcting that. When bug 600838 lands we will start testing the release build packaging code ("MOZ_PKG_PRETTYNAMES"). These packages will not be uploaded anywhere, but any build that fails in one of these targets will constitute a test failure, and turn the overall build orange. By doing so, we can ensure that any bustages to them will be caught at commit time, rather than during a release.

As of now, these tests are running on Linux (32 and 64 bit) and Windows opt en-US builds only, across all branches (including Try). Sometime next week these tests will be turned on for the remaining opt builds and l10n repacks, except the Mac ones on 1.9.1 and 1.9.2, which fail for unknown reasons, and aren't worth debugging due to their limited life going forward.

New tests coming to Linux and Windows opt builds

For a couple of years now we've been building Firefox release builds using slightly different packaging targets than nightly builds. This has streamlined our release automation in some ways, but has had the unfortunate side effect of release build packaging targets being largely untested. This week, we will start correcting that. When bug 600832 lands we will start testing the release build packaging code ("MOZ_PKG_PRETTYNAMES"). These packages will not be uploaded anywhere, but any build that fails in one of these targets will constitute a test failure, and turn the overall build orange. By doing so, we can ensure that any bustages to them will be caught at commit time, rather than during a release.

For now, these tests will be run on all Linux (32 and 64 bit) and Windows en-US opt builds, including nightlies, but will make their way to Mac builds and l10n repacks shortly.

Purple is a fruit (...and also a colour on Tinderbox!)

Today I successful landed the first part of bug 505512, which lays the ground work for catching all sorts of build problems and turning them purple, instead of red. As part of this initial work we'll now be catching most problems when cloning Mercurial repositories, turning the builds purple, and automatically retrying them.

In the next week or two I'm going to add similar behaviour for at least the following:

  • Graph Server post failures
  • Slave disconnections
  • Sendchange failures
  • out of disk issues
  • CVS checkout failures (yes, we still use CVS....)

If there's other things people can think of that should be flagged as infra problems, or that should cause builds to be retried, please add them to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors. Bonus points if you write the regular expression that catches it :-).

Currently, the purpleness is only visible on plain Tinderbox, but once bug 592340 is resolved, TBPL will support it as well.

More than 3 months worth of machine time has been saved by *you*

Ever since bug 541364 landed 4 months ago it's been possible to selectively disable platforms on try by overriding specific mozconfigs. Since that time, roughly 2321 hours (that's 96 days or 13 weeks or ~3 months) of machine time have been saved through this -- and that calculation is only compile time, even more than that has been saved on the test side. I just want to say a huuuuuuge THANKS! on behalf of RelEng. Taking the time to disable unneeded things on a push makes a noticeable difference in the time it takes us to turn around a full set of tests, especially during busier times.

More stats:

  • The most common platforms disabled were Mac 64-bit (115 times), Maemo (4: 115, 5 gtk: 106, 5 qt: 105), and Android (120 times)
  • The least common platforms disabled were Windows (58 times), Linux 32-bit (85 times), and Mac 32-bit (96 times)