Signing Software at Scale

Mozilla produces a lot of builds. We build Firefox for somewhere between 5 to 10 platforms (depending how you count). We release Nightly and Aurora every single day, Beta twice a week, and Release and ESR every 6 weeks (at least). Each release contains an en-US build and nearly a hundred localized repacks. In the past the only builds we signed were Betas (which were once a week at the time), Releases, and ESRs. We had a pretty well established manual for it, but due to being manual it was still error prone and impractical to use for Nightly and Aurora. Signing of Nightly and Aurora became an important issue when background updates were implemented because one of the new security requirements with background updates was signed installers and MARs.

Enter: Signing Server

At this point it was clear that the only practical way to sign all the builds that we need to is to automate it. It sounded crazy to me at first. How can you automate something that depends on secret keys, passphrases, and very unfriendly tools? Well, there's some tricks you need to know, and throughout the development and improvement of our "signing server", we've learned a lot. In the post I'll talk about those tricks and show you how can use them (or even our entire signing server!) to make your signing process faster and easier.

Credit where credit is due: Chris AtLee wrote the core of the signing server and support for some of the signature types. Over time Erick Dransch, Justin Wood, Dustin Mitchell, and I have made some improvements and added support for additional types of signatures.


Tip #1: Collect passphrases at startup

This should be obvious to most, but it's very important not to store the passphrases to your private keys unencrypted. However, because they're needed to unlock the private keys when doing any signing the server needs to have access to them somehow. We've dealt with this by asking for them when launching a signing server instance:

$ bin/python tools/release/signing/signing-server.py signing.ini
gpg passphrase: 
signcode passphrase: 
mar passphrase: 

Because instances are started manually by someone in the small set of people with access to passphrases we're able to ensure that keys are never left unencrypted at rest.

Tip #2: Don't let just any machine request signed files

One of the first problems you run into when you have an API for signing files is how to make sure you don't accidentally sign malicious files. We've dealt with this in a few ways:

  • You need a special token in order to request any type of signing. These tokens are time limited and only a small subset of segregated machines may request them (on behalf of the build machines). Since build jobs can only be created if you're able to push to hg.mozilla.org, random people are unable to submit anything for signing.
  • Only our build machines are allowed to make signing requests. Even if you managed to get hold of a valid signing token, you wouldn't be able to do anything with it without also having access to a build machine. This is a layer of security that helps us protect against a situation where an evil doer may gain access to a loaner machine or other less restricted part of our infrastructure.

We have other layers of security built in too (HTTPS, firewalls, access control, etc.), but these are the key ones built into the signing server itself.

Tip #3: Use input redirection and other tricks to work around unfriendly command line tools

One of the trickiest parts about automating signing is getting all the necessary command line tools to accept input that's not coming from a console. Some of them are relative easy and accept passphrases via stdin:

proc = Popen(command, stdout=stdout, stderr=STDOUT, stdin=PIPE)
proc.stdin.write(passphrase)
proc.stdin.close()

Others, like OpenSSL, are fussier and require the use of pexpect:

proc = pexpect.spawn("openssl", args)
proc.logfile_read = stdout
proc.expect('Enter pass phrase')
proc.sendline(passphrase)

And it's no surprise at all that OS X is the fussiest of them all. In order to sign you have to unlock the keychain by hand, run the signing command, and relock the keychain yourself:

child = pexpect.spawn("security unlock-keychain" + keychain)
child.expect('password to unlock .*')
child.sendline(passphrase)
check_call(sign_command + [f], cwd=dir_, stdout=stdout, stderr=STDOUT)
check_call(["security", "lock-keychain", keychain])

Although the code is simple in the end, a lot of trial, error, and frustration was necessary to arrive at it.

Tip #4: Sign everything you can on Linux (including Windows binaries!)

As fussy as automating tools like openssl can be on Linux, it pales in comparison to trying to automate anything on Windows. In the days before the signing server we had a scripted signing method that ran on Windows. Instead of providing the passphrase directly to the signing tool, it had to typed into a modal window. It was "automated" with an AutoIt script that typed in the password whenever the window popped up. This was hacky, and sometimes lead to issues if someone moved the mouse or pressed a key at the wrong time and changed window focus.

Thankfully there's tools available for Linux that are capable of signing Windows binaries. We started off by using Mono's signcode - a more or less drop in replacement for Microsoft's:

$ signcode -spc MozAuthenticode.spc -v MozAuthenticode.pvk -t http://timestamp.verisign.com/scripts/timestamp.dll -i http://www.mozilla.com -a sha1 -tr 5 -tw 60 /tmp/test.exe
Mono SignCode - version 2.4.3.1
Sign assemblies and PE files using Authenticode(tm).
Copyright 2002, 2003 Motus Technologies. Copyright 2004-2008 Novell. BSD licensed.

Enter password for MozAuthenticode.pvk: 
Success

This works great for 32-bit binaries - we've been shipping binaries signed with it for years. For some reason that we haven't figured out though, it doesn't sign 64-bit binaries properly. For those we're using "osslsigncode", which is an OpenSSL based tool to do Authenticode signing:

$ osslsigncode -certs MozAuthenticode.spc -key MozAuthenticode.pvk -i http://www.mozilla.com -h sha1 -in /tmp/test64.exe -out /tmp/test64-signed.exe
Enter PEM pass phrase:
Succeeded

$ osslsigncode verify /tmp/test64-signed.exe 
Signature verification: ok

Number of signers: 1
    Signer #0:
        Subject: /C=US/ST=CA/L=Mountain View/O=Mozilla Corporation/CN=Mozilla Corporation
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Code Signing CA-1

Number of certificates: 3
    Cert #0:
        Subject: /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Root CA
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Root CA
    Cert #1:
        Subject: /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Code Signing CA-1
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Root CA
    Cert #2:
        Subject: /C=US/ST=CA/L=Mountain View/O=Mozilla Corporation/CN=Mozilla Corporation
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Code Signing CA-1

In addition to Authenticode signing we also do GPG, APK, and couple of Mozilla-specific types of signing (MAR, EME Voucher) on Linux. We also sign our Mac builds with the signing server. Unfortunately, the tools needed for that are only available on OS X, so we have to run separate signing servers for these.

Tip #5: Run multiple signing servers

Nobody likes a single point of failure, so we've built support our signing client to retry against multiple instances. Even if we lose part of our signing server pool, our infrastructure stays up:
$ python signtool.py --cachedir cache -t token -n nonce -c host.cert -H dmgv2:mac-v2-signing1.srv.releng.scl3.mozilla.com:9120 -H dmgv2:mac-v2-signing2.srv.releng.scl3.mozilla.com:9120 -H dmgv2:mac-v2-signing3.srv.releng.scl3.mozilla.com:9120 -H dmgv2:mac-v2-signing4.srv.releng.scl3.mozilla.com:9120 --formats dmgv2 Firefox.app
2015-01-23 06:17:59,112 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing3.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:17:59,118 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: connection error; trying again soon
2015-01-23 06:18:00,119 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:18:00,141 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: uploading for signing
2015-01-23 06:18:10,748 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:19:11,848 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:19:40,480 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: OK

Running your own signing server

It's easy! All of the code you need to run your own signing server is in our tools repository. You'll need to set-up a virtualenv and create your own config file, but once you're ready you can attempt to start it with the following command:

python signing-server.py signing.ini

You'll be prompted for the passphrases to your private keys. If there's any problems with your config file or the passphrases the server will fail to start. Once you've got it up and running you can use try signing! get_token.py has an example of how to generate a signing token, and signtool.py will take your unsigned files and give you back signed versions. Happy signing!

UPDATED: New update server is going live for release channel users on Tuesday, January **20th**

(This post has been updated with the new go-live date.)

Our new update server software (codenamed Balrog) has been in development for quite awhile now. In October of 2013 we moved Nightly and Aurora to it. This past September we moved Beta users to it. Finally, we're ready to switch the vast majority of our users over. We'll be doing that on the morning of Tuesday, January 20th. Just like when we switched nightly/aurora/beta over, this change should be invisible, but please file a bug or swing by #releng if you notice any issues with updates.

Stick around if you're interested in some of the load testing we did.


Shortly after switching all of the Beta users to Balrog we did a load test to see if Balrog could handle the amount of traffic that the release channel would throw at it. With just 10% of the release traffic being handled, it blew up:

We were pulling more than 150MBit/sec per web head from the database server, and saturating the CPUs completely. This caused very slow requests, to the point where many were just timing out. While we were hoping that it would just work, this wasn't a complete surprise given that we hadn't implemented any form of caching yet. After implementing a simple LRU cache on Balrog's largest objects, we did another load test. Here's what the load looked like on one web head:

Once caching was enabled the load was practically non-existent. As we ramped up release channel traffic the load grew, but in a more or less linear (and very gradual) fashion. At around 11:35 on this graph we were serving all of the release channel traffic, and each web head was using a meager 50% of its CPU:

I'm not sure what to call that other than winning.

Redo 1.3 is released - now with more natural syntax!

We've been using the functions packaged in Redo for a few years now at Mozilla. One of the things we've been striving for with it is the ability to write the most natural code possible. In it's simplest form, retry, a callable that may raise, the exceptions to retry on, and the callable to run to cleanup before another attempt - are all passed in as arguments. As a result, we have a number of code blocks like this, which don't feel very Pythonic:

retry(self.session.request, sleeptime=5, max_sleeptime=15,

  retry_exceptions=(requests.HTTPError,

                    requests.ConnectionError),

  attempts=self.retries,

  kwargs=dict(method=method, url=url, data=data,

              config=self.config, timeout=self.timeout,

              auth=self.auth, params=params)

)

It's particularly unfortunate that you're forced to let retry do your exception handling and cleanup - I find that it makes the code a lot less readable. It's also not possible to do anything in a finally block, unless you wrap the retry in one.

Recently, Chris AtLee discovered a new method of doing retries that results in much cleaner and more readable code. With it, the above block can be rewritten as:


for attempt in retrier(attempts=self.retries):

    try:

        self.session.request(method=method, url=url, data=data,

                             config=self.config,

                             timeout=self.timeout, auth=self.auth,

                             params=params)

        break

    except (requests.HTTPError, requests.ConnectionError), e:

        pass

retrier simply handles the the mechanics of tracking attempts and sleeping, leaving your code to do all of its own exception handling and cleanup - just as if you weren't retrying at all. It's important to note that the break at the end of the try block is important, otherwise self.session.request would run even if it succeeded.

I released Redo 1.3 with this new functionality this morning - enjoy!

Stop stripping (OS X builds), it leaves you vulnerable

While investigating some strange update requests on our new update server, I discovered that we have thousands of update requests from Beta users on OS X that aren't getting an update, but should. After some digging I realized that most, if not all of these are coming from users who have installed one of our official Beta builds and subsequently stripped out the architecture they do not need from it. In turn, this causes our builds to report in such a way that we don't know how to serve updates for them.

We'll look at ways of addressing this, but the bottom line is that if you want to be secure: Stop stripping Firefox binaries!

New update server has been rolled out to Firefox/Thunderbird Beta users

Yesterday marked a big milestone for the Balrog project when we made it live for Firefox and Thunderbird Beta users. Those with a good long term memory may recall that we switched Nightly and Aurora users over almost a year ago. Since then, we've been working on and off to get Balrog ready to serve Beta updates, which are quite a bit more complex than our Nightly ones. Earlier this week we finally got the last blocker closed and we flipped it live yesterday morning, pacific time. We have significantly (~10x) more Beta users than Nightly+Aurora, so it's no surprise that we immediately saw a spike in traffic and load, but our systems stood up to it well. If you're into this sort of thing, here are some graphs with spikey lines:

The load average on 1 (of 4) backend nodes:

The rate of requests to 1 backend node (requests/second):

Database operations (operations/second):

And network traffic to the database (MB/sec):

Despite hitting a few new edge cases (mostly around better error handling), the deployment went very smoothly - it took less than 15 minutes to be confident that everything was working fine.

While Nick and I are the primary developers of Balrog, we couldn't have gotten to this point without the help of many others. Big thanks to Chris and Sheeri for making the IT infrastructure so solid, to Anthony, Tracy, and Henrik for all the testing they did, and to Rail, Massimo, Chris, and Aki for the patches and reviews they contributed to Balrog itself. With this big milestone accomplished we're significantly closer to Balrog being ready for Release and ESR users, and retiring the old AUS2/3 servers.

Upcoming changes to Mac package layout, signing

Apple recently announced changes to how OS X applications must be packaged and signed in order for them to function correctly on OS X 10.9.5 and 10.10. The tl;dr version of this is "only mach-O binaries may live in .app/Contents/MacOS, and signing must be done on 10.9 or later". Without any changes, future versions of Firefox will cease to function out-of-the-box on OS X 10.9.5 and 10.10. We do not have a release date for either of these OS X versions yet.

Changes required:

  • Move all non-mach-O files out of .app/Contents/MacOS. Most of these will move to .app/Contents/Resources, but files that could legitimately change at runtime (eg: everything in defaults/) will move to .app/MozResources (which can be modified without breaking the signature): https://bugzilla.mozilla.org/showdependencytree.cgi?id=1046906&hide_resolved=1. This work is in progress, but no patches are ready yet.

  • Add new features to the client side update code to allow partner repacks to continue to work. (https://bugzilla.mozilla.org/show_bug.cgi?id=1048921)

  • Create and use 10.9 signing servers for these new-style apps. We still need to use our existing 10.6 signing servers for any builds without these changes. (https://bugzilla.mozilla.org/show_bug.cgi?id=1046749 and https://bugzilla.mozilla.org/show_bug.cgi?id=1049595)

  • Update signing server code to support new v2 signatures.

Timeline:

We are intending to ship the required changes with Gecko 34, which ships on November 25th, 2014. The changes required are very invasive, and we don't feel that they can be safely backported to any earlier version quickly enough without major risk of regressions. We are still looking at whether or not we'll backport to ESR 31. To this end, we've asked that Apple whitelist Firefox and Thunderbird versions that will not have the necessary changes in them. We're still working with them to confirm whether or not this can happen.

This has been cross posted a few places - please send all follow-ups to the mozilla.dev.platform newsgroup.

June 17th Nightly/Aurora updates of Firefox, Fennec, and Thunderbird will be slightly delayed

As part of the ongoing work to move our Betas and Release builds to our new update server, I'll be landing a fairly invasive change to it today. Because it requires a new schema for its data updates will be slightly delayed while the data repopulates in the new format as the nightlies stream in. While that's happening, updates will continue to point at the builds from today (June 16th).

Once bug 1026070 is fixed, we will be able to do these sort of upgrades without any delay to users.

How to not get spammed by Bugzilla

Bugmail is a running joke at Mozilla. Nearly everyone I know that works with Bugzilla (especially engineers) complains about the amount of bugmail they get. I too suffered from this problem for years, but with some tweaks to preferences and workflow, this problem can be solved. Here's how I do it:

E-mail preferences

  • Disable e-mail completely for cc changes and other things that don't generally matter to you. For me, this includes the keyword field and even the dependency tree.
  • If you follow components, make sure you only get mail for NEW bugs in that component. You can cc yourself explicit to things that you decide to care about. This one has been huge for me. I follow 5 components, and I would get hundreds of additional mail per day if I got mail about every change to them.
  • Set "Automatically add me to the CC list of bugs I am requested to review" and "Automatically add me to the CC list of bugs I change" to "never". See the workflow section below for more on this.
  • Set-up an e-mail filter to automatically mark your own changes as "read". I like to get mail for this for better searchability, but there's no reason it should be something I need to look at when it comes in. You can do this by matching against the "X-Bugzilla-Who" header.
  • If you filed a bug you no longer care about, Mozilla's Bugzilla now has an "Ignore Bug Mail" field that will make it stop mailing you about it.

Here's what my full e-mail settings look like:

And here's my Zimbra filter for changes made by me (I think the "from" header part is probably unnecessary, though):

Workflow

This section is mostly just an advertisement for the "My Dashboard" feature on Mozilla's Bugzilla. By default, it shows you your assigned bugs, requested flags, and flags requested of you. Look at it at regular intervals (I try to restrict myself to once in the morning, and once before my EOD), particularly the "flags requested of you" section.

The other important thing is to generally stop caring about a bug unless it's either assigned to you, or there's a flag requested of you specifically. This ties in to some of the e-mail pref changes above. Changing my default state from "I must keep track of all bugs I might care about" to "I will keep track of my bugs & my requests, and opt-in to keeping tracking of anything else" is a shift in mindset, but a game changer when it comes to the amount of e-mail (and cognitive load) that Bugzilla generates.

With these changes it takes me less than 15 minutes to go through my bugmail every morning (even on Mondays). I can even ignore it at times, because "My Dashboard" will make sure I don't miss anything critical. Big thanks to the Bugzilla devs who made some of these new things possible, particularly glob and dkl. Glob also mentioned that even more filtering possibilities are being made possible by bug 990980. The preview he sent me looks infinitely customizable:

This week in Mozilla RelEng – June 13th, 2014 - *double edition*

I spaced and forgot to post this last week, so here's a double edition covering everything so far this month. I'll also be away for the next 3 Fridays, and Kim volunteered to take the reigns in my stead. Now, on with it!

Major highlights:

Completed work (resolution is 'FIXED'):

In progress work (unresolved and not assigned to nobody):

More on "How far we've come"

After I posted "How far we've come" this morning a few people expressed interest in what our release process looked like before, and what it looks like now.

The earliest recorded release process I know of was called the "Unified Release Process". (I presume "unified" comes from unifying the ways different release engineers did things.) As you can see, it's a very lengthy document, with lots of shell commands to tweak/copy/paste. A lot of the things that get run are actually scripts that wrap some parts of the process - so it's not as bad as it could've been.

I was around for much of the improvements to this process. Awhile back I wrote a series of blog posts detailing some of them. For those interested, you can find them here:

I haven't gotten around to writing a new one for the most recent version of the release automation, but if you compare our current Checklist to the old Unified Release Process, I'm sure you can get a sense of how much more efficient it is. Basically, we have push-button releases now. Fill in some basic info, push a button, and a release pops out: