Mozilla Software Release GPG Key Transition

Late last week we discovered the expiration of the GPG key that we use to sign Firefox, Fennec, and Thunderbird nightly builds and releases. We had been aware that this was coming up, but we unfortunately missed our deadline to renew it. This caused failures in many of our automated nightly builds, so it was quickly noticed and acted upon.

Our new GPG key is as follows, and available on keyservers such as gpg.mozilla.org and pgp.mit.edu:

pub   4096R/0x61B7B526D98F0353 2015-07-17
      Key fingerprint = 14F2 6682 D091 6CDD 81E3  7B6D 61B7 B526 D98F 0353
uid                            Mozilla Software Releases 
sub   4096R/0x1C69C4E55E9905DB 2015-07-17 [expires: 2017-07-16]

The new primary key is signed by many Mozillians, the old master key, as well as our OpSec team's GPG key. Nightlies and releases will now be signed with the subkey (0x1C69C4E55E9905DB), and a new one will be generated from the same primary key before this one expires. This means that you can validate Firefox releases with the primary public key in perpetuity.

We are investigating a few options to make sure key renewal happens without delay in the future.

Mozilla will stop producing automated builds of XULRunner after the 41.0 cycle

XULRunner is a runtime package that can be used to run XUL+XPCOM based applications. Automated builds of it have been produced alongside Firefox since 2006, but it has not been a supported or resourced product for many years. We've continued to produce automated builds of it because its build process also happens to build the Gecko SDK, which we do support and maintain. This will change soon, and we'll start building the Gecko SDK from Firefox instead (bug 672509). This work will land on mozilla-central during the 42.0 cycle, which means that when the 41.0 cycle ends (September 22, 2015), automated builds of XULRunner will cease.

If you are a consumer of the Gecko SDK this means very little to you -- we will continue to produce it with every Firefox release.

If you are a consumer of the XULRunner stub this means that you will no longer have a Mozilla produced version after 41.0. For folks in this group, you have two options:

  • Change your app to run through the stub provided by Firefox. Many apps will continue to work as before by simply replacing "xulrunner.exe application" with "firefox -app application.ini".
  • Build XULRunner yourself.

Buildbot <-> Taskcluster Bridge Now in Production

A few weeks ago I gave a brief overview of the Buildbot <->Taskcluster Bridge that we've been developing, and Selena provided some additional details about it yesterday. Today I'm happy to announce that it is ready to take on production work. As more and more jobs from our CI infrastructure move to Taskcluster, the Bridge will coordinate between them and jobs that must remain in Buildbot for the time being.

What's next?

The Bridge itself is feature complete until our requirements change (though there's a couple of minor bugs that would be nice to fix), but most of the Buildbot Schedulers still need to be replaced with Task Graphs. Some of this work will be done at the same time as porting specific build or test jobs to run natively in Taskcluster, but it doesn't have to be. I made a proof of concept on how to integrate selected Buildbot builds into the existing "taskcluster-graph" command and disable the Buildbot schedulers that it replaces. With a bit more work this could be extended to schedule all of the Buildbot builds for a branch, which would make porting specific jobs simpler. If you'd like to help out with this, let me know!

Buildbot Taskcluster Bridge - An Overview

Mozilla has been using Buildbot as its continuous integration system for Firefox and Fennec for many years now. It enabled us to switch from a machine-per-build model to a pool-of-slaves model, and greatly aided us in getting to our current scale. But it's not perfect - and we've known for a few years that we'll need to do an overhaul. Lucky for us, the FirefoxOS Automation team has built up a fantastic piece of infrastructure known as Taskcluster that we're eager to start moving to.

It's not going to be a small task though - it will take a lot more work than taking our existing build scripts and running them in Taskcluster. One reason for this is that many of our jobs trigger other jobs, and Buildbot manages those relationships. This means that if we have a build job that triggers a test job, we can't move one without moving the other. We don't want to be forced into moving entire job chains at once, so we need something to help us transition more slowly. Our solution to this is to make it possible to schedule jobs in Taskcluster while still implementing them in Buildbot. Once the scheduling is in Taskcluster it's possible to move individual jobs to Taskcluster one at a time. The software that makes this possible is the Buildbot Bridge.

The Bridge is responsible for synchronizing job state between Taskcluster and Buildbot. Jobs that are requested through Taskcluster will be created in Buildbot by the Bridge. When those jobs complete, the Bridge will update Taskcluster with their status. Let's look at a simple example to see see how the state changes in both systems over the course of a job being submitted and run:

Event Taskcluster state Buildbot state
Task is created Task is pending --
Bridge receives "task-pending" event, creates BuildRequest Task is pending Build is pending
Build starts in Buildbot Task is pending Build is running
Bridge receives "build started" event, claims the Task Task is running Build is running
Build completes successfully Task is running Build is completed
Bridge receives "build finished" event, reports success to Taskcluster Task is resolved Build is completed

The details of how this work are a bit more complicated - if you'd like to learn more about that I recommend watching the presentation I did about the Bridge architecture, or just have a read through my slides

A great tech support experience, from the most unlikely provider

After ranting and raving about them so much when trying to cancel my TV service, I feel like I should also share the really great experience I had with Bell recently. We've been having intermittent issues with our Internet connection. It wasn't clear whether or not it was our router or an issue with the DSL modem or line. I was dreading calling Bell's tech support because of so many bad experiences with tech support from all companies in the past.

To my great surprise, the call was one of the best tech support experiences I've ever had. The level 1 tech was the normal "turn it off, turn it on", but the level 2 seemed genuinely motivated to help me. She did not freak out and blame me when I told her I had my own router hooked up (something that Bell discourages). It was almost a brainstorming session for a few minutes of where the issue may be. I explained that the connection seems to drop when doing high bandwidth things like Skype or other videoconferencing. She thought for a second and then wondered if it only happened on wifi. That was a great insight, because Netflix (via our media centre PC, hooked up via ethernet) caused no issues at all. It was pretty clear to me that the problem was with my own router at that point, but she still insisted on running a line test to make sure there was nothing there. After that showed no issues I told her I was pretty sure the issue was with my own router, so I'd replace that. Even at that point she was insisting that I call back if I continue to experience issues, to the point that her supervisor called me back a few hours later to follow-up.

I don't know if this was a one-off experience or if something changed recently, but that was amazing tech support - even better than I received from geek-oriented ISPs like TekSavvy. Thank you Bell, keep it up.

Release Automation Futures: Seamless integration of manual and automated steps

I've written about the history of our Release Automation systems in the past. We've gone from mostly manual releases to almost completely automated since I joined Mozilla. One thing I haven't talked about before is Ship It - our web tool for kicking off releases:



It may be ugly, but having it has meant that we don't have to log on to a single machine to ship a release. A release engineer doesn't even need to be around to start the release process - Release Management has direct access to Ship It to do it themselves. We're only needed to push releases live, and that's something we'd like to fix as well. We're looking at tackling that and other ancillary issues of releases, such as:

  • Improving and expanding validation of release automation inputs (revisions, branches, locales, etc.)
  • Scripting the publishing of Fennec to Google Play
  • Giving release Release Managers more direct control over updates
  • Updating metadata (ship dates, versions, locales) about releases
  • Improving security with better authentication (eg, HSMs or other secondary tokens) and authorization (eg, requiring multiple people to push updates)

Rail and I had a brainstorming session about this yesterday and a theme that kept coming up was that most of the things we want to improve are on the edges of release automation: they happen either before the current automation starts, or after the current automation ends. Everything in this list also needs someone to decide that it needs to happen -- our automation can't make the decision about what revision a release should be built with or when to push it to Google Play - it only knows how to do those things after being told that it should. These points where we jump back and forth between humans and automation are a big rough edge for us right now. The way they're implemented currently is very situation-specific, which means that adding new points of human-automation interaction is slow and full of uncertainty. This is something we need to fix in order to continue to ship as fast and effectively as we do.

We think we've come up a new design that will enable us to deal with all of the current human-automation interactions and any that come up in the future. It consists of three key components:

Workflows

A workflow is a DAG that represents an entire release process. It consists of human steps, automation steps, and potentially other types. An important point about workflows is that they aren't necessarily the same for every release. A Firefox Beta's workflow is different than a Fennec Beta or Firefox Release. The workflow for a Firefox Beta today may look very different than for one a few months from now. The details of a workflow are explicitly not baked into the system - they are part of the data that feeds it. Each node in the DAG will have upstreams, downstreams, and perhaps a list of notifications. The tooling around the workflow will respond to changes in state of each node and determine what can happen next. Much of each workflow will end up being the existing graph of Buildbot builders (eg: this graph of Firefox Beta jobs).

We're hoping to use existing software for this part. We've looked at Amazon's Simple Workflow Service already, but it doesn't support any dependencies between nodes, so we're not sure if it's going to fit the bill. We're also looking at Taskcluster which does do dependency management. If anyone knows of anything else that might be useful here please let know!

Ship It

As well as continuing to provide a human interface, Ship It will be the API between the workflow tool and humans/automation. When new nodes become ready it makes that information available to automation, or gives humans the option to enact them (depending on node type). It also receives state changes of nodes from automation (eg, build completion events). Ship It may also be given the responsibility of enforcing user ACLs.

Release Runner

Release Runner is the binding between Ship It and the backend parts of the automation. When Ship It is showing automation events ready to start, it will poke the right systems to make them go. When those jobs complete, it will send that information back to Ship It.

This will likely be getting a better name.


This design still needs some more thought and review, but we're very excited to be moving towards a world where humans and machines can integrate more seamlessly to get you the latest Firefox hotness more quickly and securely.

Signing Software at Scale

Mozilla produces a lot of builds. We build Firefox for somewhere between 5 to 10 platforms (depending how you count). We release Nightly and Aurora every single day, Beta twice a week, and Release and ESR every 6 weeks (at least). Each release contains an en-US build and nearly a hundred localized repacks. In the past the only builds we signed were Betas (which were once a week at the time), Releases, and ESRs. We had a pretty well established manual for it, but due to being manual it was still error prone and impractical to use for Nightly and Aurora. Signing of Nightly and Aurora became an important issue when background updates were implemented because one of the new security requirements with background updates was signed installers and MARs.

Enter: Signing Server

At this point it was clear that the only practical way to sign all the builds that we need to is to automate it. It sounded crazy to me at first. How can you automate something that depends on secret keys, passphrases, and very unfriendly tools? Well, there's some tricks you need to know, and throughout the development and improvement of our "signing server", we've learned a lot. In the post I'll talk about those tricks and show you how can use them (or even our entire signing server!) to make your signing process faster and easier.

Credit where credit is due: Chris AtLee wrote the core of the signing server and support for some of the signature types. Over time Erick Dransch, Justin Wood, Dustin Mitchell, and I have made some improvements and added support for additional types of signatures.


Tip #1: Collect passphrases at startup

This should be obvious to most, but it's very important not to store the passphrases to your private keys unencrypted. However, because they're needed to unlock the private keys when doing any signing the server needs to have access to them somehow. We've dealt with this by asking for them when launching a signing server instance:

$ bin/python tools/release/signing/signing-server.py signing.ini
gpg passphrase: 
signcode passphrase: 
mar passphrase: 

Because instances are started manually by someone in the small set of people with access to passphrases we're able to ensure that keys are never left unencrypted at rest.

Tip #2: Don't let just any machine request signed files

One of the first problems you run into when you have an API for signing files is how to make sure you don't accidentally sign malicious files. We've dealt with this in a few ways:

  • You need a special token in order to request any type of signing. These tokens are time limited and only a small subset of segregated machines may request them (on behalf of the build machines). Since build jobs can only be created if you're able to push to hg.mozilla.org, random people are unable to submit anything for signing.
  • Only our build machines are allowed to make signing requests. Even if you managed to get hold of a valid signing token, you wouldn't be able to do anything with it without also having access to a build machine. This is a layer of security that helps us protect against a situation where an evil doer may gain access to a loaner machine or other less restricted part of our infrastructure.

We have other layers of security built in too (HTTPS, firewalls, access control, etc.), but these are the key ones built into the signing server itself.

Tip #3: Use input redirection and other tricks to work around unfriendly command line tools

One of the trickiest parts about automating signing is getting all the necessary command line tools to accept input that's not coming from a console. Some of them are relative easy and accept passphrases via stdin:

proc = Popen(command, stdout=stdout, stderr=STDOUT, stdin=PIPE)
proc.stdin.write(passphrase)
proc.stdin.close()

Others, like OpenSSL, are fussier and require the use of pexpect:

proc = pexpect.spawn("openssl", args)
proc.logfile_read = stdout
proc.expect('Enter pass phrase')
proc.sendline(passphrase)

And it's no surprise at all that OS X is the fussiest of them all. In order to sign you have to unlock the keychain by hand, run the signing command, and relock the keychain yourself:

child = pexpect.spawn("security unlock-keychain" + keychain)
child.expect('password to unlock .*')
child.sendline(passphrase)
check_call(sign_command + [f], cwd=dir_, stdout=stdout, stderr=STDOUT)
check_call(["security", "lock-keychain", keychain])

Although the code is simple in the end, a lot of trial, error, and frustration was necessary to arrive at it.

Tip #4: Sign everything you can on Linux (including Windows binaries!)

As fussy as automating tools like openssl can be on Linux, it pales in comparison to trying to automate anything on Windows. In the days before the signing server we had a scripted signing method that ran on Windows. Instead of providing the passphrase directly to the signing tool, it had to typed into a modal window. It was "automated" with an AutoIt script that typed in the password whenever the window popped up. This was hacky, and sometimes lead to issues if someone moved the mouse or pressed a key at the wrong time and changed window focus.

Thankfully there's tools available for Linux that are capable of signing Windows binaries. We started off by using Mono's signcode - a more or less drop in replacement for Microsoft's:

$ signcode -spc MozAuthenticode.spc -v MozAuthenticode.pvk -t http://timestamp.verisign.com/scripts/timestamp.dll -i http://www.mozilla.com -a sha1 -tr 5 -tw 60 /tmp/test.exe
Mono SignCode - version 2.4.3.1
Sign assemblies and PE files using Authenticode(tm).
Copyright 2002, 2003 Motus Technologies. Copyright 2004-2008 Novell. BSD licensed.

Enter password for MozAuthenticode.pvk: 
Success

This works great for 32-bit binaries - we've been shipping binaries signed with it for years. For some reason that we haven't figured out though, it doesn't sign 64-bit binaries properly. For those we're using "osslsigncode", which is an OpenSSL based tool to do Authenticode signing:

$ osslsigncode -certs MozAuthenticode.spc -key MozAuthenticode.pvk -i http://www.mozilla.com -h sha1 -in /tmp/test64.exe -out /tmp/test64-signed.exe
Enter PEM pass phrase:
Succeeded

$ osslsigncode verify /tmp/test64-signed.exe 
Signature verification: ok

Number of signers: 1
    Signer #0:
        Subject: /C=US/ST=CA/L=Mountain View/O=Mozilla Corporation/CN=Mozilla Corporation
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Code Signing CA-1

Number of certificates: 3
    Cert #0:
        Subject: /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Root CA
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Root CA
    Cert #1:
        Subject: /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Code Signing CA-1
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Root CA
    Cert #2:
        Subject: /C=US/ST=CA/L=Mountain View/O=Mozilla Corporation/CN=Mozilla Corporation
        Issuer : /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert Assured ID Code Signing CA-1

In addition to Authenticode signing we also do GPG, APK, and couple of Mozilla-specific types of signing (MAR, EME Voucher) on Linux. We also sign our Mac builds with the signing server. Unfortunately, the tools needed for that are only available on OS X, so we have to run separate signing servers for these.

Tip #5: Run multiple signing servers

Nobody likes a single point of failure, so we've built support our signing client to retry against multiple instances. Even if we lose part of our signing server pool, our infrastructure stays up:
$ python signtool.py --cachedir cache -t token -n nonce -c host.cert -H dmgv2:mac-v2-signing1.srv.releng.scl3.mozilla.com:9120 -H dmgv2:mac-v2-signing2.srv.releng.scl3.mozilla.com:9120 -H dmgv2:mac-v2-signing3.srv.releng.scl3.mozilla.com:9120 -H dmgv2:mac-v2-signing4.srv.releng.scl3.mozilla.com:9120 --formats dmgv2 Firefox.app
2015-01-23 06:17:59,112 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing3.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:17:59,118 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: connection error; trying again soon
2015-01-23 06:18:00,119 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:18:00,141 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: uploading for signing
2015-01-23 06:18:10,748 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:19:11,848 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: processing Firefox.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9120
2015-01-23 06:19:40,480 - ed40176524e7c197f4e23f6065a64dc3c9a62e71: OK

Running your own signing server

It's easy! All of the code you need to run your own signing server is in our tools repository. You'll need to set-up a virtualenv and create your own config file, but once you're ready you can attempt to start it with the following command:

python signing-server.py signing.ini

You'll be prompted for the passphrases to your private keys. If there's any problems with your config file or the passphrases the server will fail to start. Once you've got it up and running you can use try signing! get_token.py has an example of how to generate a signing token, and signtool.py will take your unsigned files and give you back signed versions. Happy signing!

UPDATED: New update server is going live for release channel users on Tuesday, January **20th**

(This post has been updated with the new go-live date.)

Our new update server software (codenamed Balrog) has been in development for quite awhile now. In October of 2013 we moved Nightly and Aurora to it. This past September we moved Beta users to it. Finally, we're ready to switch the vast majority of our users over. We'll be doing that on the morning of Tuesday, January 20th. Just like when we switched nightly/aurora/beta over, this change should be invisible, but please file a bug or swing by #releng if you notice any issues with updates.

Stick around if you're interested in some of the load testing we did.


Shortly after switching all of the Beta users to Balrog we did a load test to see if Balrog could handle the amount of traffic that the release channel would throw at it. With just 10% of the release traffic being handled, it blew up:

We were pulling more than 150MBit/sec per web head from the database server, and saturating the CPUs completely. This caused very slow requests, to the point where many were just timing out. While we were hoping that it would just work, this wasn't a complete surprise given that we hadn't implemented any form of caching yet. After implementing a simple LRU cache on Balrog's largest objects, we did another load test. Here's what the load looked like on one web head:

Once caching was enabled the load was practically non-existent. As we ramped up release channel traffic the load grew, but in a more or less linear (and very gradual) fashion. At around 11:35 on this graph we were serving all of the release channel traffic, and each web head was using a meager 50% of its CPU:

I'm not sure what to call that other than winning.

Redo 1.3 is released - now with more natural syntax!

We've been using the functions packaged in Redo for a few years now at Mozilla. One of the things we've been striving for with it is the ability to write the most natural code possible. In it's simplest form, retry, a callable that may raise, the exceptions to retry on, and the callable to run to cleanup before another attempt - are all passed in as arguments. As a result, we have a number of code blocks like this, which don't feel very Pythonic:

retry(self.session.request, sleeptime=5, max_sleeptime=15,

  retry_exceptions=(requests.HTTPError,

                    requests.ConnectionError),

  attempts=self.retries,

  kwargs=dict(method=method, url=url, data=data,

              config=self.config, timeout=self.timeout,

              auth=self.auth, params=params)

)

It's particularly unfortunate that you're forced to let retry do your exception handling and cleanup - I find that it makes the code a lot less readable. It's also not possible to do anything in a finally block, unless you wrap the retry in one.

Recently, Chris AtLee discovered a new method of doing retries that results in much cleaner and more readable code. With it, the above block can be rewritten as:


for attempt in retrier(attempts=self.retries):

    try:

        self.session.request(method=method, url=url, data=data,

                             config=self.config,

                             timeout=self.timeout, auth=self.auth,

                             params=params)

        break

    except (requests.HTTPError, requests.ConnectionError), e:

        pass

retrier simply handles the the mechanics of tracking attempts and sleeping, leaving your code to do all of its own exception handling and cleanup - just as if you weren't retrying at all. It's important to note that the break at the end of the try block is important, otherwise self.session.request would run even if it succeeded.

I released Redo 1.3 with this new functionality this morning - enjoy!

Stop stripping (OS X builds), it leaves you vulnerable

While investigating some strange update requests on our new update server, I discovered that we have thousands of update requests from Beta users on OS X that aren't getting an update, but should. After some digging I realized that most, if not all of these are coming from users who have installed one of our official Beta builds and subsequently stripped out the architecture they do not need from it. In turn, this causes our builds to report in such a way that we don't know how to serve updates for them.

We'll look at ways of addressing this, but the bottom line is that if you want to be secure: Stop stripping Firefox binaries!