Using Authenticode Code Signing Certificates with OS X’s Signing Tools

Our intern Erick has been doing some great work reviving, polishing and finalizing the patches that will allow us to start signing OS X builds (more to come on that in his blog!). When we do start signing them we’re planning to use our existing set of code signing certificates for them rather than buy new ones. I thought it would be a simple task to convert them so I set off to convert our internal, self-generated ones. After hours and hours of head scratching and frustration I learned that some versions of Microsoft’s “makecert” tool are broken, and generate invalid PKCS7 certs that openssl can’t cope with properly. From the OpenSSL PKCS#12 FAQ:

Q. What are SPC files?
A. They are simply DER encoded PKCS#7 files containing the certificates. Well they are in the newer versions of the tools. The older versions used an invalid PKCS#7 format.

The end result of all my attempts ended up being a PKCS#12 certificate that Apple’s codesign tool claimed couldn’t be used to do code signing.

After finding that FAQ, I decided to try to convert our Nightly code signing certificate instead. Following the great instructions found on Marc Liyanage’s blog I managed to successfully convert the certificate, import it into a Keychain, and successfully sign something! Here’s the shortened version of what I did. Note that it requires the PVK tool found here:

~/pvk.exe -in Nightly.pvk -out Nightly.key.pem
openssl pkcs7 -inform der -print_certs < Nightly.spc > Nightly.cert.pem
openssl pkcs12 -export -inkey Nightly.key.pem -in Nightly.cert.pem -out Nightly.p12

I hope this helps someone else avoid the same frustration!

Release Automation – Part 3: Improvements & Optimizations (2009 to early 2011)

In my last post I talked about the major project of switching the Release Automation from Bootstrap driven by Buildbot to being directly implemented in Buildbot, and working out of Mercurial. After a 6 month break from automation work, there were a few spurts of development on the automation over the course of the next two years. Some of these were big new things, like Fennec automation, while others were deliberate attempts to improve the automation. This post will cover the most important changes that happened from late 2009 all the way through early 2011.

Late 2009 to mid 2010

Fennec Release Automation

In 2009 Mozilla began working on a version of Firefox for Maemo. Late in that year, we shipped 1.0rc1 with the Release Automation. Some people may be thinking “that doesn’t sound very hard, it’s just another platform right?”. Unfortunately, there’s a lot of hidden complexity with adding a new platform that doesn’t confirm to long-held assumptions, like mobile. While the actual build process is fairly similar there’s a lot of pre-build, post-build, and other things that just aren’t the same. Fennec was the first product we supported that was built out of multiple source repositories, which not only caused problems for builds (and isn’t handled well by Buildbot), but affected how we tag repositories and generate source tarballs. L10n repacks were also completely different for Fennec: not only did we ship individual l10n builds for many locales, but we also shipped builds with multiple locales in them. Doing this meant build process changes as well as a new format to describe locales, revisions, and which types of repacks each one needed. All of this combined ended up being nearly a month of work (and many late nights, Aki tells me) to get up and running! This was the first product we’ve ever shipped that had automated releases from the start, which is a huge accomplishment for forwarding thinking & planning – something that we simply didn’t have time for in the past. It’s hard to determine how many hours of end2end time and # of manual touchpoints this saved since it was never manual work to begin with but there’s no doubt that we’re far better off with it than without.

Major Update

In the latter half of 2009 we started doing a lot of Major Updates. That is, offering 3.0.x users an update to a 3.5.x release. Behind the scenes, each Major Update offer took approximately 4 hours to create and had at least 6 or 7 manual touchpoints in order to do config file bumping, snippet generation, test snippet pushing, and verification of those snippets. Each one had an end2end time of 4 hours or so and had at least 6 or 7 manual touchpoints. If that wasn’t bad enough, a single mistake in the configuration file would cause us to have to restart the entire process! Automating this turned out to be one of the easier pieces of new automation because of how similar Major Updates were to the regular updates we already did with every release. When this relatively simple work was done, all of the manual touchpoints were gone completely and because these were now done automatically with a release instead of out of band they moved out of the critical path and therefore had no end2end time impact either! This is always the best kind of new automation =).

Bouncer Entries

In mid-2010 we automated a long standing annoyance: Bouncer entry creation. Like the Major Updates, this was something that was subject to manual error. More importantly, it was _damn_ annoying to do. Bouncer is the piece of software that powers download.mozilla.org, which redirects download requests to our mirror network. Each time we release we need to tell it where to find the files we ship. This translates to one entry for each installer, complete MAR, and partial MAR for each platform. Prior to this being fixed this was done mostly through copy and paste which has a massive margin for error. In the best case scenario this means we’ll get some 404s, which are easy to detect and fix. In the worst case we could point at the wrong release entirely, which is an error that may not get caught at all. Fixing this didn’t improve our end2end time at all but it did take away the most annoying manual touchpoint, which we were all very happy about.

After this change the automation stayed relatively stable for the next 6 months, with only minor bugfixes happening.

Late 2010 to early 2011

At the end of 2010 and start of 2011 we began a huge round of upgrades and optimizations starting with upgrading to a new version of Buildbot. This work wasn’t shinyfun, but long overdue after the regular Continuous Integration infrastructure had upgraded many months prior.

After that was done some of us spent the next couple of months working hard on some new automation & improvements. This was one of the most exciting and active times for the Release Automation. We lowered end2end time by parallelizing some things, we took away many manual touchpoints with new pieces of automation, and we dramatically improved stability through intelligent retrying of failed operations. Also of note is that went back to a model of having standalone scripts doing work and having Buildbot drive those, not unlike the Buildbot+Bootstrap era. This came about after having a lot of challenges implementing some things directly in Buildbot code, which makes it very difficult to make decisions at runtime, and the feeling that we didn’t want to tie ourselves to Buildbot forever.

Source Code Tagging

At the time, source repository tagging was one of the rougher parts of the automation. Not only did it often fail to push tags back to a repository due to losing a push race, but load issues caused us to get server side errors. For a period of time it was rare that a release *didn’t* have a problem in tagging. Moving the tagging code to an external script made fixing these errors a lot easier. At the same time, we were able to start building up some very useful libraries for working with Mercurial, retrying failed commands, and other things. Since these changes have landed it’s been very rare to have issues with tagging, and most of them have been regressions from recently landed things rather than long standing bugs with the tagging scripts.

L10n Repacks

We used to have similar issues with our l10n repacking logic, too. Sometimes the jobs would die while trying to clone a repository or when trying to download or upload a build. Additionally, we used to use a different Buildbot job for each locale, which meant that we would redo steps like “clone/pull from source repository” for every single locale which was quite inefficient. As you may have guessed, we did a similar thing to fix these issues: moved them to a script! Because of the earlier work done with tagging we were able to get retrying of repository cloning for free, and easily add retrying of uploads/downloads. This script also introduced another new technique to the Release Automation: chunking (which was shamelessly ripped off of the Mochitest harness). Rather than have 1 Buildbot job for every single locale, the script knows how to compute the overall set of work for all locales and pick a chunk of it to work on.

Automated E-mail

Every release requires a lot of coordination, particularly with Release Drivers and QA. We need to send mail notifications when the Release Automation starts, when each platform’s en-US build is complete, when each platform’s l10n repacks are complete, when updates are ready for testing, and some other events, too. It used to be that the Release Engineer responsible for the release would actively watch the jobs on a Buildbot display and send mail by hand as the jobs completed. Especially as we started doing releases more often, this became extremely tedious and distracting. It also caused artificial delays of up to 8 hours (in the worst case)! By automating these mails we massively reduced manual touchpoints, became more consistent with the messages we sent, allowed Release Engineers to more easily do other work mid-release, and in some extreme cases reduced end2end time of a release by multiple hours. Looking back on it this was one of the most important changes we’ve ever made, and certainly had the best cost/benefit ratio.

Pushing to Mirrors et. al

When we push a Firefox release out to the mirror network we get past the point of no return. Once it’s out there, we have no way to pull it back and no way to guarantee that we overwrite all of the files on all of the mirrors in a timely manner. If we find bugs past that point we have to increment the version number and start again. Because of that we do a full antivirus check and verification of all permissions prior to pushing (in addition to all of the testing that QA already does). These used to be done all by hand – a Release Engineer would log onto a machine at some point between builds being available and prior to pushing, run some commands, and wait. Besides the annoyance of doing it by hand, we would sometimes forget to do this in advance of the release. When that happened these things all of a sudden were in the critical path, and holding up the release. To address both of those issues these checks were automated and done immediately after all release files became available. At the same time we partly automated the mirror push itself. Pushing to mirrors involves running a command like:
rsync -av --exclude=*tests* --exclude=*crashreporter* --exclude=*.log --exclude=*.txt --exclude=*unsigned* --exclude=*update-backup* --exclude=*partner-repacks* --exclude=*.checksums --exclude=logs --exclude=jsshell* --exclude=*/*.asc /pub/mozilla.org/firefox/nightly/10.0.2-candidates/build1/ /pub/mozilla.org/firefox/releases/10.0.2/
With such a non-trivial thing being required every time it’s easy to make mistake, so once again, automating is a clear way to reduce manual error.

Autosign

All of the Firefox builds that we distribute are signed in some manner. On Windows, we have Authenticode Signatures; for everything else we have detached GPG signatures. Signing our builds is a crucial part of the release process and right in the middle of the critical path. Because we ship Firefox in so many languages and on multiple platforms it can take awhile to do all of our signing, which means it’s important to get it started as soon as possible. In the past, we had to wait for all builds & repacks to complete and then run a long series of manual commands on our signing machine to: download the builds, sign them, verify them, and upload the signed bits. This was OK for awhile, but as we started shipping in more languages on more platforms it became horribly inefficient; downloading the builds alone started to take 30 minutes or more. And again, like many other things, there was lots of opportunity for manual error. Enter: Autosign. This relatively simple improvement adjusted the existing signing logic to be able to detect when it had all of the required bits to start signing. This meant that we could run the commands that would start signing as soon as the release began. The scripts continually download builds in a loop, in parallel with the rest of the automation running, which means we completely remove the “download builds” part of the signing process from the critical path. This also means that the Release Engineer doesn’t need to be at work or even awake when all of the builds & repacks complete. In some cases, just like automated e-mail, this can save multiple hours of end2end time.

Summary

The combination of all of the changes above took the automation from a moderately fast system that worked most of the time to a very speedy system that rarely fails. Nearly everyone in Release Engineering had a hand in this, and most of them were done over a two month period!

Incredibly, there was still more we found to improve in the following year, which I’ll talk about in Part 4!

Release Automation – Part 2: Mercurial-based, v1

Around the start of 2008 Mozilla moved Firefox and Gecko development from CVS to Mercurial, with Firefox 3.5 (nee 3.1) as the first release out of the new repository. In addition to that, the underlying build infrastructure had switched from being Tinderbox driven, to being Buildbot driven – which made some of the existing release automation useless. In mid-2008 we started planning to port, rework, and update the release automation for this new environment. The 2008 Firefox Summit conveniently happened right around this time, so we took that opportunity to gather a quorum on the subject and go over all the plans in detail. By the end of the night (and end of the beer, if I recall correctly), we had discussed everything to death a tracking bug.

This version of the automation struck a balance between improving the overall design of the system and simply doing straight porting work. The plain porting isn’t very interesting, so I’ll be mostly focusing on the improvements we made in this post.

One of the bigger optimizations we made to to generate files in their final location at build time. In the Bootstrap days we uploaded files to flat directories with long filenames, and then re-arranged them into their final layout later on in the process. With this change made our candidates directories looked a lot more like the the associated release directory. This may not sound like a huge change but it cut our disk space usage per release in half or more, shaved over an hour off the end2end time of the release, and let us put our release file naming logic into the build system, where it more rightly belonged. It also allowed us to make the next optimization: combining the signing processes.

In the Bootstrap and pre-Bootstrap worlds we had two separate signing processes: one to sign the internal guts of Firefox win32 builds (firefox.exe, xul.dll, et. al) and one to sign the Firefox installers themselves. Early on, we signed the internal bits and handed them off to QA. Closer to release time, we signed the installers themselves and generated GPG signatures for all files. The only reason I can think of why we would do this is to keep signed installers out of public directories until we’re sure we’ll be releasing them. This isn’t without its drawbacks though. Leaving this until later in the process added unnecessary manual touchpoints, put non-trivial work late in the critical path, and worst of all: It meant QA did not test the exact bits that we shipped to users! (We actually managed to ship unsigned installers once, which isn’t possible anymore.) Improving this only required a small rework of our existing signing scripts (and lots of testing, of course!) but it took another 1-2h off of our end2end time and removed another manual touchpoint.

It’s also worth noting that merely by switching to Mercurial we saved over half an hour in end2end time in tagging. In CVS, we had to create a branch and tag thousands and thousands of files with multiple tags, which takes a very long time. In Mercurial, we have clone a repository, which takes some time, but the tagging itself is near-instant.

In addition to the optimizations noted above, tons of work was done porting the existing automation. Many things had to be pulled out of Bootstrap and put into their own scripts to make them usable by both versions of the automation; en-US builds and l10n repacks had to be reimplemented entirely in Buildbot; and some other things that couldn’t be pulled out of Bootstrap had to be reimplemented as well. It was a very large undertaking that was primarily worked on by Nick Thomas, Coop, and myself and took months to complete.

Firefox 3.1b3 was the first fully automated release with this automation. By the time we worked out most of the kinks we were at end2end time of 8-10h and about 12 manual touchpoints.

Next up: Various improvements & optimizations (not as boring as it sounds, I promise!)

Release Automation – Part 1: Bootstrap

One of the first tasks I had as a full-time employee of Mozilla was getting the Bootstrap Release framework working with Firefox 3.0 Beta releases. Now, just over 4 years later, our Release Automation has changed dramatically in many ways: primary language, supported platforms, scope and extent, reliability, and versatility. I thought it made be interesting to trace the path from there to here, and talk about what’s in store for the future, too. Throughout all of this work there’s been two overarching goals: 1) Lower the time it takes to go from “go to build” to “updates available for testing” – which we call “end2end time”, and 2) Remove the number of machines we have to log into, commands we have to run, and active time we have to spend on a release – known as “manual touchpoints”. I’ll be referencing these a lot throughout this series.

This post will talk about what I know of Bootstrap and my work porting it to Firefox 3.0.

In its earliest form Bootstrap was a simple scripted version of much of the previously manual release process. The processes for tagging VCS repositories, creating deliverables (source packages, en-US and localized builds, updates), and some verifications were encapsulated into its scripts. This was a big improvement over the 100% manual, cut+paste-from-a-wiki, process. Instead of logging into many machines and running many commands, the release engineer had to log in to many machines and run a few, very simple commands. The very first release that was Bootstrap-aided was Firefox 1.5.0.9, built on December 6th, 2006. This was before my time, but a former release engineer, Rob Helmer, told me that the end2end time back then could be multiple days, and countless touchpoints.

Over time, more parts of the release process were automated with Bootstrap, further reducing the burden on the release engineer. Even with these big improvements some classes of things were still not codified: which machines to run which commands on, when and in what order to run things, who to notify about what. Enter: Buildbot. Integrating Bootstrap into Buildbot was the next logical step in the process. It would handle scheduling and status, while Bootstrap would remain responsible for all of implementation. With this, the release engineer only had to log in to a few machines and run a few, very simple commands. Another big improvement! The first release to benefit from this was Firefox 2.0.0.8, built on October 10th, 2007. This work was largely done by Rob Helmer.

Around this time we were gearing up to start shipping the first Firefox 3.0 Beta release and had never tested Bootstrap against that development branch. I was tasked with making whatever changes were necessary to Bootstrap and our Buildbot to make it work. The Buildbot side was largely simple, because of it being at such a high abstraction layer, but back in these days we still had single purpose Buildbot masters, so it involved adding several hundred lines of config code.

The Bootstrap side was far more interesting. Until this point, there was a lot of built-in assumptions based on what the 1.8 branch looked like, including:

  • Releases are done from CVS branches (explicitly _not_ trunk)
  • Windows build machines run Cygwin
  • Linux packages are in .gz format
  • The crash reporting system Talkback is always shipped

By themselves, none of these things are too challenging to deal with, but as a very new hire, the combination took me about a month to find solutions to and fully test, with many rounds of feedback and guidance along the way. With all of that done and landed, we managed to use the new automation to build Firefox 3.0b2 on December 10, 2007. At this point, the end2end time was around 24h and there were about 20 manual touchpoints.

Over the next 8 months or so there were a few major improvements of note. Firstly, Nick Thomas fixed bug 409394 (Support for long version names) allowed us to start shipping releases with nicer looking filenames like “Firefox Setup 3.0 Beta 4″. Not a crucial thing, but much nicer from the user perspective. bug 422235 (enable fast patcher for release automation) was a massive improvement in update generation, written by schrep. With this work, we went from taking 6-8 hours to generate updates, down to ~1h — an incredible savings in time. Finally, bug 428063 (Support major releases & quit using rc in overloaded ways) (also fixed by Nick) enabled us to build RCs with Bootstrap. While it may sound simple, there’s a lot of things in release automation that depend on filename, and catching them all can be difficult. As well as making it possible to build these, this bug also renamed the internal “rc” notion to “build”, to avoid situations where we’d have things like “3.0 RC1 rc1″, which was utterly confusing.


So, in the early days there were tons of improvement quickly: Bootstrap itself sped things up and lowered the possibility of error through reducing manual touchpoints. Buildbot + Bootstrap did so again, through the same methods. We also had pure speed-ups through things such as fast patcher. Having these things allowed us to maintain the 2.0.0.x and 3.0.x branches more more easily, and get chemspill releases out quickly and simultaneously. All of this work had to be done incrementally too, because we had to continue shipping releases while the work was happening. It’s hard to find good data for releases done with this version of the automation, but I guesstimate that the end2end time was around 12-14 hours and the number of manual touchpoints was still around 20 for a release without major issues.

Next up….Release Automation on Mercurial, v1.