A Flurry of Balrog Activity

This past quarter I spent some time modernizing Balrog's toolchain to make it more approachable. We've switched from Vagrant to Docker, cleaned up setup.py, started using tox, and updated the sample data included in the repo. At the same time, I started identifying some good first bugs, and put together a proposal for a Summer of Code project.

I feel very lucky that this work has paid off so quickly. There's been great interest in the Summer of Code project, and we've had 5 new volunteers submit patches to Balrog. These people are doing really great work, and I'd like to highlight their contributions today (in no particular order).

Njira Perci

Njira has focused on UI improvements, and has already improved the confirmation dialog for deleting Rules and added the ability to autocomplete Products and Channels in form fields. She continues to hack away and is now working on improving the Releases UI to highlight whether or not a release is active in any way.

Ashish Sareen

Ashish fixed a bug where the Admin server would hit an ISE 500 under certain conditions. With his patch, it now correctly returns a 400 error to the client.

Varun Joshi

Varun has been diving deep into the Admin server. He started off by fixing a small bug where some weirdly formed Releases caused ISE 500s, and has since provided a patch that gives us the ability to mark Releases as "read only". This is something we intend to make use of in our new Release Promotion system to guard accidental (or malicious...) changes to Release metadata.

Aybüke Özdemir

Aybüke enhanced the UI to show rule_ids, which makes it esaier for humans to find them when they need to put them into a script or automation.

Kumar Rishabh

Kumar fixed a very annoying bug where diffs of different versions of Releases would be generated against the wrong base version, making them essentially useless.


If you would like to get involved in the development of Balrog, we'd love to have you. The wiki page can get you bootstrapped, and you can find us all on irc.mozilla.org in #balrog.

Streamlining throttled rollout of Firefox releases

When we ship new versions of Firefox we do our best to avoid introducing new bugs or crashes, especially those that affect large numbers of users. One of the strategies we use to accomplish this is to ship new versions to a subset of people before shipping to everyone. We call this a "throttled rollout", and it's something we've been doing for many years. The tricky part of this is getting the new version to enough users to have a representative sample size without overshooting our target.

Our current process for doing this is as follows:

  • Enable updates to the new version at a rate of 25% (meaning 25% of update requests will be offered the new version)
  • Wait ~24 hours
  • Turn the update rate down to 0%
  • Hope that we hit our target without going over

The rate and time period has been tuned over time, but it's still a very fragile process. Sometimes we get more or fewer update requests than expected during the 24h window. Sometimes we forget to set the rate back down to 0%. A process that's driven manually and dependent on guesswork has a lot of things that can go wrong. We can do better here. What if we could schedule rate changes to avoid forgetting to make them? What if we could monitor real-time uptake information to eliminate the guesswork? Nick and I have come up with a plan that allows Balrog to do these things, and I'm excited to share it.

Enter: Balrog Agent

The Balrog Agent will be a new component of the system that can be configured to enact changes to update rules at specific times or when certain conditions are met. We will allow users to schedule changes through Admin UI, the Agent will watch for their condition(s) to be hit, and then enact the requested change. For now the only condition we will support is uptake of a specific version on a specific channel, which we will soon be able to get from Telemetry. This diagram shows where the Agent fits into the system:

Once implemented, our new process could look something like:

  • Add scheduled change to enable updates to the new version at a rate of 25%
  • Add scheduled change to turn the update rate down to 0% after we hit our target uptake
  • Let the Balrog Agent do the rest

Unlike the manual changes in our current process, the creation of both scheduled changes is not time sensitive - it can be done at any point prior to release day. This means that humans don't have to be around and/or remember to flip bits at certain times, nor do we have to worry about tweaking the time windows as our uptake rate changes. It Just Works (tm).

As always, security was a concern when designing the Balrog Agent. We don't want it to have root-like access to the Balrog database, we just want it to be able to make the specific changes that users have already set-up. To satisify this requirement, we'll be adding a special endpoint to the Admin API (something like /rules/scheduled_changes) which can only enact changes that users have previously scheduled. When users schedule new changes through the UI, Balrog will ensure that they have permission to make the change they're scheduling. The Agent will use the new endpoint to enact changes, which prevents it from making changes that a user didn't explicitly request. As with other parts of Balrog's database, the history of scheduled changes and when they were enacted will be kept to ensure that they are auditable.

Because this is the first time we'll have an automated system making changes to update rules at unpredictible times, another concern that came up was making sure that humans are not surprised when it happens. It's going to feel weird at first to have the release channel update rate managed by automation. To minimize the surprise and confusion of this we're planning to have the Agent send out e-mail before making changes. This serves as a heads up us humans and gives us time to react if the Agent is about to make a change that may not be desired anymore.

We know from past experience that it's impossible for us to predict the interesting ways and conditions we'll want to offer updates. One of the things I really like about this design is that the only limit to what we can do is the data that the Agent has available. While it's starting off with uptake data, we can enhance it later to look at Socorro or other key systems. Wouldn't it be pretty cool if we automatically shut off updates if we hit a major crash spike? I think so.

If you're interested in the nitty-gritty details of this project there's a lot more information in the bug. If you're interested in Balrog in general, I encourage you to check out the wiki or come chat with us on IRC.

Collision Detection and History with SQLAlchemy

Balrog is one of the more crucial systems that Release Engineering works on. Many of our automated builds send data to it and all Firefox installations in the wild regularly query it to look for updates. It is an SQLAlchemy based app, but because of its huge importance it became clear in the early stages of development that we had a couple of requirements that went beyond those of most other SQLAlchemy apps, specifically:

  • Collision Detection: Changes to the database must always be done safely. Balrog must not allow one change to silently override another one.
  • Full History: Balrog must provide complete auditability and history for all changes. We must be able to associate every change with an account and a timestamp.

Implementing these two requirements ended up being a interesting project and I'd like to share the details of how it all works.

Collision Detection

Anyone who's used Bugzilla for awhile has probably encountered this screen before:

This screenshot shows how Bugzilla detects and warns if you try to make a change to a bug before loading changes someone else has made. While it's annoying when this screen slows you down, it's important that Bugzilla doesn't let you unknowingly overwrite other folks' changes. This is very similar to what we wanted to do in Balrog, except that we needed to enforce it at the API level, not just in the UI. In fact, we decided it was best to enforce it at the lowest level possible to minimize the change of needing to duplicate it in different parts of the app.

To do this, we started by creating a thin wrapper around SQLAlchemy which ensures that each table has a "data_version" column, and requires an "old_data_version" to be passed when doing an UPDATE or DELETE. Here's a slimmed down version of how it works with UPDATE:

class AUSTable(object):
    """Base class for Balrog tables. Subclasses must create self.table as an
    SQLAlchemy Table object prior to calling AUSTable.__init__()."""
    def __init__(self, engine):
        self.engine = engine
        # Ensure that the table has a data_version column on it.
        self.table.append_column(Column("data_version", Integer, nullable=False))

    def update(self, where, what, old_data_version):
        # Enforce the data_version check at the query level to eliminate
        # the possibility of a race condition between the time we can
        # retrieve the current data_version, and when we can update the row.
        where.append(self.table.data_version == old_data_version)

        with self.engine.connect().begin() as transaction:
            row = self.select(where=where, transaction=transaction)
            row["data_version"] += 1
            for col in what:
                row[col] = what[col]

            query = self.table.update(values=what):
            for cond in where:
                query = query.where(cond)
            ret = transaction.execute(query)

            if ret.rowcount != 1:
                raise OutdatedDataError("Failed to update row, old_data_version doesn't match data_version")

And one of our concrete tables:

class Releases(AUSTable):
    def __init__(self, engine, metadata):
        self.table = Table("releases", metadata,
            Column("name", String(100), primary_key=True),
            Column("product", String(15), nullable=False),
            Column("data", Text(), nullable=False),

    def updateRelease(self, name, old_data_version, product, data):
        what = {
            "product": product,
            "data": data,
        self.update(where=[self.table.name == name], what=what, old_data_version=old_data_version)

As you can see, the data_version check is added as a clause to the UPDATE statement - so there's no way we can race with other changes. The usual workflow for callers is to retrieve the current version of the data, modify it, and pass it back along with old data_version (most of the time retrieval and pass back happens through the REST API). It's worth pointing out that a client could pass a different value as old_data_version in an attempt to thwart the system. This is something we explicitly do not try to protect against (and honestly, I don't think we could) -- data_version is a protection against accidental races, not against malicious changes.

Full History

Having history of all changes to Balrog's database is not terribly important on a day-to-day basis, but when we have issues related to updates it's extremely important that we're able to look back in time and see why a particular issue happened, how long it existed for, and who made the change. Like collision detection, this is implemented at a low level of Balrog to make sure it's difficult to bypass when writing new code.

To achieve it we create a History table for each primary data table. For example, we have both "releases" and "releases_history" tables. In addition to all of the Releases columns, the associated History table also has columns for the account name that makes each change and a timestamp of when it was made. Whenever an INSERT, UPDATE, or DELETE is executed, the History table has a new row inserted with the full contents of the new version. These are done is a single transaction to make sure it is an all-or-nothing operation.

Building on the code from above, here's a simplified version of how we implement History:

class AUSTable(object):
    """Base class for Balrog tables. Subclasses must create self.table as an
    SQLAlchemy Table object prior to calling AUSTable.__init__()."""
    def __init__(self, engine, history=True, versioned=True):
        self.engine = engine
        self.versioned = versioned
        # Versioned tables (generally, non-History tables) need a data_version.
        if versioned:
            self.table.append_column(Column("data_version", Integer, nullable=False))

        # Well defined interface to the primary_key columns, needed by History tables.
        self.primary_key = []
        for col in self.table.get_children():
            if col.primary_key:

        if history:
            self.history = History(self.table.metadata, self)
            self.history = None

    def update(self, where, what, old_data_version=None, changed_by=None):
        # Because we're a base for History tables as well as normal tables
        # these must be optional parameters, but enforced when the features
        # are enabled.
        if self.history and not changed_by:
            raise ValueError("changed_by must be passed for Tables that have history")
        if self.versioned and not old_data_version:
            raise ValueError("update: old_data_version must be passed for Tables that are versioned")

        # Enforce the data_version check at the query level to eliminate
        # the possibility of a race condition between the time we can
        # retrieve the current data_version, and when we can update the row.
        where.append(self.table.data_version == old_data_version)

        with self.engine.connect().begin() as transaction:
            row = self.select(where=where, transaction=transaction)
            row["data_version"] += 1
            for col in what:
                row[col] = what[col]

            query = self.table.update(values=what):
            for cond in where:
                query = query.where(cond)
            ret = transaction.execute(query)
            if self.history:
                transaction.execute(self.history.forUpdate(row, changed_by))
            if ret.rowcount != 1:
                raise OutdatedDataError("Failed to update row, old_data_version doesn't match data_version")

class History(AUSTable):
    def __init__(self, metadata, baseTable):
        self.baseTable = baseTable
        self.Table("%s_history" % baseTable.table.name, metadata,
            Column("change_id", Integer(), primary_key=True, autoincrement=True),
            Column("changed_by", String(100), nullable=False),
            Column("timestamp", BigInteger(), nullable=False),

        self.base_primary_key = [pk.name for pk in baseTable.primary_key]
        # In addition to the above columns, we need a copy of each Column
        # from our base table.
        for col in baseTable.table.get_children():
            newcol = col.copy()
            # We have our own primary_key Column, and don't want our
            # base table's PK to be part of it.
            if col.primary_key:
                newcol.primary_key = False
            # And while the base table's primary key is always required for
            # history rows, all other columns (including those that are
            # required in the base table) must be nullable.
                newcol.nullable = True

        AUSTable.__init__(self, history=False, versioned=False)

    def forUpdate(self, rowData, changed_by):
        row = {}
        # Copy in the data that's about to be updated in the base table...
        for k in rowData:
            row[k] = rowData[k]
        # ...and add the extra data that we need to track history accurately.
        row["changed_by"] = changed_by
        row["timestamp"] = time.time()
        return self.table.insert(values=rows)

A key thing to notice here is that the History tables are maintained automatically with only a minor tweak to the query interface (addition of "changed_by"). And while not shown here, it's important to note that the History table objects are not queryable directly through any exposed API. Even if an attacker got access to Balrog's REST API with admin permissions, they cannot delete rows from those tables.

If you'd like to see the complete implementation of either of these, you can find it over in the Balrog repository.


These things were implemented a few years ago, and since then we've discovered a couple of rough edges that would be nice to fix.

The biggest complaint is that the History tables are extremely inefficient. Many of our Release objects are a few hundred kilobytes, which means every change to them (thousands per day) significantly grows the releases_history table. We've dealt with this by limiting how long we keep history for certain types of releases, but it's far less than ideal. We'd love to have a more efficient way of storing history. We've discussed storing history as diffs rather than full copies or compressing the data before inserting the rows, but haven't settled on anything yet. If you have any ideas about this we'd love to hear them!

I mentioned earlier how annoying it is when Bugzilla throws you a mid-air collision, and it's no different in Balrog. We get hundreds of them a day when locales l10n repacks all try to update the same Releases. These can be dealt with by retrying but it's very inefficient. We might be able to do a better here if we inspected the details of changes that collide, and only reject them if they try to modify the same parts of an object.

Finally, all of this awesome collision detection and history code is in no way tied to Balrog - the classes that implement it are already very generic. I would love to pull out these features and ship them as their own module, which Balrog (and hopefully others!) can then consume.

Configuring uWSGI to host an app and static files

This week I started using uWSGI for the first time. I'm in the process of switching Balrog from Vagrant to Docker, and I'm moving away from Apache in the process. Because of Balrog's somewhat complicated Apache config this ended up being more difficult than I thought. Although uWSGI's docs are OK, I found it a little difficult to put them into practice without examples, so here's hoping this post will help others in similar situations.

Balrog's Admin app consists of a pretty standard Python WSGI app, and a static Angular app hosted on the same domain. To complicate matters, the version of Angular that we use does not support being hosted anywhere except the root of the domain. It took a bit of futzing, but we came up with an Apache config to host both of these pieces on the same domain pretty quickly:

<VirtualHost *:80>
    ServerName balrog-admin.mozilla.dev
    DocumentRoot /home/vagrant/project/ui/dist/

    # Rewrite virtual paths in the angular app to the index page
    # so that refreshes/linking works, while leaving real files
    # such as the js/css alone.
    <Directory /home/vagrant/project/ui/dist>
        RewriteEngine On
        RewriteCond %{REQUEST_FILENAME} -f [OR]
        RewriteCond %{REQUEST_FILENAME} -d

        RewriteRule ^ - [L]
        RewriteRule ^ index.html [L]

    # The WSGI app is rooted at /api
    WSGIScriptAlias /api /home/vagrant/project/admin.wsgi
    WSGIDaemonProcess aus4-admin processes=1 threads=1 maximum-requests=50 display-name=aus4-admin
    WSGIProcessGroup aus4-admin
    WSGIPassAuthorization On

    # The WSGI app relies on the web server to do the authentication, and will
    # bail if REMOTE_USER isn't set. To simplify things, we just set this
    # variable instead of prompting for auth.
    SetEnv REMOTE_USER balrogadmin

    LogLevel Debug
    ErrorLog "|/usr/sbin/rotatelogs /var/log/httpd/balrog-admin.mozilla.dev/error_log_%Y-%m-%d 86400 -0"
    CustomLog "|/usr/sbin/rotatelogs /var/log/httpd/balrog-admin.mozilla.dev/access_%Y-%m-%d 86400 -0" combined

Translating this to uWSGI took way longer than expected. Among the problems I ran into were:

  • Using --env instead of --route's addvar action to set REMOTE_USER (--env turns out to be for passing variables to the overall WSGI app).
  • Forgetting to escape "$" when passing routes on the command line, which caused my shell to interpret variables intended for uWSGI
  • Trying to rewrite URLs to a static path, which I only discovered is invalid after stumbling on an old mailing list thread.
  • Examples from uWSGI's own documentation did not work! I discovered that depending on how it was compiled, you may need to pass "--plugin python,http" to give all of the necessary command line options for what I was doing.

After much struggle, I came up with an invocation that worked exactly the same as the Apache config:

uwsgi --http :8080 --mount /api=admin.wsgi --manage-script-name --check-static /app/ui/dist --static-index index.html --route "^/.*$ addvar:REMOTE_USER=balrogadmin" --route-if "startswith:\${REQUEST_URI};/api continue:" --route-if-not "exists:/app/ui/dist\${PATH_INFO} static:/app/ui/dist/index.html"

There's a lot crammed in there, so let's break it down:

  • --http :8080 tells uWSGI to listen on port 8080
  • --mount /api=admin.wsgi roots the "admin.wsgi" app in /api. This means that when you make a request to http://localhost:8080/api/foo, the application sees "/foo" as the path. If there was no Angular app, I would simply use "--wsgi-file admin.wsgi" to place the app at the root of the server.
  • --manage-script-name causes uWSGI to rewrite PATH_INFO and SCRIPT_NAME according to the mount point. This isn't necessary if you're not using "--mount".
  • --check-static /app/ui/dist points uWSGI at a directory of static files that it should serve. In my case, I've pointed it at the fully built Angular app. With this, requests such as http://localhost:8080/js/app.js returns the static file from /app/ui/dist/js/app.js.
  • --static-index index.html tells uWSGI to serve index.html when a request for a directory is made - the default is to 404, because there's no built-in directory indexing.
  • The --route's chain together, and are evaluated as follows:
  • If the requested path matches ^/.*$ (all paths will), set the REMOTE_USER variable to balrogadmin.
  • If the REQUEST_URI starts with /api do not process any more --route's; just satisfy the request. All requests intended for the WSGI app will end up matching here. REQUEST_URI is used instead of PATH_INFO because the latter is written by --manage-script-name
  • If the requested file does not exist in /app/ui/dist, serve /app/ui/dist/index.html instead. PATH_INFO and REQUEST_URI will still point at the original file, which lets Angular interpret the virtual path and serve the correct thing.

In the end, uWSGI seems to be one of the things that's very scary when you first approach it (I count about 750 command line arguments), but is pretty easy to understand when you get to know it a better. This is almost the opposite of Apache - I find it much more approachable, perhaps because there's such a littany of examples out there, but things like mod_rewrite are very difficult for me to understand after the fact, at least compared to uWSGI's --route's.

Improvements to updates for Foxfooders

We've been providing on-device updates (that is to say: no flashing required) to users in the Foxfood program for nearly 6 months now. These updates are intended for users who are officially part of the Foxfooding program, but the way our update system works means that anyone who puts themselves on the right update channel can receive them. This makes things tough for us, because we'd like to be able to provide official Foxfooders with some extra bits and we can't do that while these populations are on the same update channel. Thanks to work that Rob Wood and Alexandre Lissy are doing, we'll soon be able to resolve this and get Foxfooders the bits they need to do the best possible testing.

To make this possible, we've implemented a short term solution that lets us only serve updates to official Foxfooders. When landed, they will send a hashed version of their IMEI as part of their update request. A list of the acceptable IMEI hashes will be maintained in Balrog (the update server), which lets us only serve an update if the incoming one matches one of the whitelisted ones.

To really make this work we need to detangle the current "dogfood" update channel. As I mentioned, it's currently being used in two distinct populations of users: those are part of the official program, and those who aren't. In order to support both populations of users we'll be splitting the "dogfood" update channel into two:

  1. The new "foxfood" channel will be for users who are officially part of the Foxfooding program. Users on this channel will be part of the IMEI whitelist, and could receive FOTA or OTA updates.
  2. The "dogfood" channel will continue to serve serve OTA updates to anyone who puts themself on it.

To transition, we will be asking folks who are officially part of the Foxfooding program to flash with a new image that switches them to the "foxfood" update channel. When this is ready to go, it will be announced and communicated appropriately.

Big thanks to everyone who was involved in this effort, particularly Rob Wood, who implemented the new whitelisting feature in Balrog, and Alexandre Lissy and Jean Gong, who went through multiple rounds of back and forth before we settled on this solution.

It's worth noting that this solution isn't ideal: sending IMEIs (even hashed versions) isn't something we prefer to do for both reasons of user privacy and protection of the bits. In the longer term, we'd like to look at a solution that wouldn't require IMEIs to be sent to us. This could come in the form of embedding or asking for credentials, and using those to access the updates. This type of solution would enhance user privacy and make it harder to get around the protections by brute forcing.

Going Faster with Balrog

Go Faster is a broad initiative at Mozilla that is focused on shipping things to users much faster than the current 6 week cycle. One important part of this project is having a mechanism to make Firefox aware of updates they need or may want to download. This is nothing entirely new of course - we've been shipping updates to users since Firefox 1.5 - but with Go Faster we will be updating bits and pieces of Firefox at a time rather than always updating the entire install. In this post I'm going to outline these new types of updates that we've identified, and talk about how things will work in the Glorious Future.

A Primer on Updates

Firefox updates work on a "pull" system, meaning Firefox regularly queries the update server (Balrog) to ask if there's an update available. For example, my Firefox is currently polling for updates by making a GET request to this URL:


All of the information in that URL is mapped against a set rules in Balrog, and eventually points to a single release. If that release is newer than the incoming one (based on the incoming version and buildid), Balrog returns the information necessary for the client to update to it:

    <update type="minor" displayVersion="41.0" appVersion="41.0" platformVersion="41.0" buildID="20150917150946" detailsURL="https://www.mozilla.org/en-GB/firefox/41.0/releasenotes/">
        <patch type="complete" URL="http://download.mozilla.org/?product=firefox-41.0build3-complete&os=linux64&lang=en-GB" hashFunction="sha512" hashValue="ea0757069363287f67659d8b7d42e0ac6c74a12ce8bd3c7d3e9ad018d03cd6f4640529c270ed2b3f3e75b11320e3a282ad9476bd93b0f501a22d1d9cb8884961" size="48982398"/>

It's important to note that Balrog only contains metadata about the update. The actual payloads of the updates are hosted on CDN networks.

New Types of Updates

We've identified three different new types of updates that we'll be implementing as part of Go Faster. They are:

  • System Addons: These are core (aka required) parts of Firefox that happen to be implemented as Addons.
  • Security Policy: This is a medium sized piece of JSON that instructs NSS about special security policies to enforce for various websites.
  • Optional Features: These are optional parts of Firefox that may be implemented as Addons or other means.

Each one of these will be implemented as an additional update request to Balrog (we may collapse these into a single request later). Eg, Firefox will look for new System Addons by making a GET request to an URL such as:


The responses will vary a bit depending on the type of update. More on that below.

System Addons

Seeing as Firefox can't function without them, System Addons may seem like a contradiction at first. The advantages are quite clear though: with them, we can ship updates to self contained pieces of Firefox at a substantionally faster rate. Shipping an update to all of Firefox takes nearly 24 hours (when we're moving as fast as we can); shipping an update to a System Addon could take as little as minutes.

Although they are implemented as Addons, we can't simply ship them through the AMO. Because Firefox cannot function without them we must ship them in the installers and full updates that happen every 6 weeks. This has the nice side effect of minimizing dependency problems -- we won't run into a case where Firefox updates but System Addons don't, which could cause incompatibilities. In between the 6 week cycles Firefox will poll Balrog for updates to System Addons and apply them as they become available. This graph may show this more clearly:

As you can see, Firefox 50.0 can be assumed to have any of Fizz 1.0, 1.1, 1.2, 1.3, or 2.0, while Firefox 51.0 is known to only have Fizz 2.0 (but may receive newer versions later).

When Firefox pings Balrog for System Addon updates, the response will look something like this:

        <addon id="fizz@mozilla.org" URL="http://download.cdn.mozilla.net/fizz-1.1.xpi" hashFunction="sha512" hashValue="abcdef123456" version="1.1"/>
        <addon id="pop@mozilla.org" URL="http://download.cdn.mozilla.net/pop-2.5.xpi" hashFunction="sha512" hashValue="abcdef123456" version="2.5"/>
        <addon id="bam@mozilla.org" URL="http://download.cdn.mozilla.net/bam-3.4.xpi" hashFunction="sha512" hashValue="abcdef123456" version="3.4"/>

Firefox will compare the list against its currently installed versions and update anything that's out of date. The exact details on where System Addons will live on disk are still being ironed out.

Security Policy

Every version of Firefox ships with the most up-to-date set of security policies that were available when it was built. However, these policies are updated more frequently than we ship, and it's crucial that we keep them up to date to keep our users safe. As with System Addons, Firefox will regularly ping Balrog to check for updated security policies. When one is found, Firefox downloads it from Kinto, which will serve it an incremental update to its security policies. The details of this process have been outlined in much more detail by the Cloud Services team.

The Balrog response for these updates is extremely simple, and simply contains a version that Firefox passes along to Kinto:

        <setting id="security" lastModified="129386427328"/>

Optional Features

These are parts of Firefox that are not core to the browser, but may be useful to subsets of users. For example: We currently ship a ton of hyphenation dictionaries as part of Firefox for Android. These are locale-specific, so only one ever gets used for each user. We can also distribute opt-in features that not everyone wants or needs, eg: Developer Tools may be a good candidate (there are no plans to do so at this time though).

Optional features may also be implemented in various ways. Hyphenation dictionaries are simple zip files, while something like Developer Tools would likely be an Addon. They will not be included in Firefox installers or update packages. Instead Firefox will regularly query Balrog to see what packages may be available to it. Some things may automatically install based on the user's environment (eg: hyphentation dictionaries for your locale), while other things may require opt-in (eg: optional features).

Balrog responses are not yet set in stone for these, but Kinto is likely to be involved, so the response may end up being similar to the one above for Security Policy updates.


While System Addons, Security Policy, and Optional Features overlap in some areas, each has its own unique combination of requirements. The chart below summarizes these:
Required? Shipped in Installer? Payload Type Payload Location
System Addons Yes Yes Addons CDN
Security Policy Yes Yes JSON Kinto
Optional Features No No Anything Kinto

Mozilla Software Release GPG Key Transition

Late last week we discovered the expiration of the GPG key that we use to sign Firefox, Fennec, and Thunderbird nightly builds and releases. We had been aware that this was coming up, but we unfortunately missed our deadline to renew it. This caused failures in many of our automated nightly builds, so it was quickly noticed and acted upon.

Our new GPG key is as follows, and available on keyservers such as gpg.mozilla.org and pgp.mit.edu:

pub   4096R/0x61B7B526D98F0353 2015-07-17
      Key fingerprint = 14F2 6682 D091 6CDD 81E3  7B6D 61B7 B526 D98F 0353
uid                            Mozilla Software Releases 
sub   4096R/0x1C69C4E55E9905DB 2015-07-17 [expires: 2017-07-16]

The new primary key is signed by many Mozillians, the old master key, as well as our OpSec team's GPG key. Nightlies and releases will now be signed with the subkey (0x1C69C4E55E9905DB), and a new one will be generated from the same primary key before this one expires. This means that you can validate Firefox releases with the primary public key in perpetuity.

We are investigating a few options to make sure key renewal happens without delay in the future.

Mozilla will stop producing automated builds of XULRunner after the 41.0 cycle

XULRunner is a runtime package that can be used to run XUL+XPCOM based applications. Automated builds of it have been produced alongside Firefox since 2006, but it has not been a supported or resourced product for many years. We've continued to produce automated builds of it because its build process also happens to build the Gecko SDK, which we do support and maintain. This will change soon, and we'll start building the Gecko SDK from Firefox instead (bug 672509). This work will land on mozilla-central during the 42.0 cycle, which means that when the 41.0 cycle ends (September 22, 2015), automated builds of XULRunner will cease.

If you are a consumer of the Gecko SDK this means very little to you -- we will continue to produce it with every Firefox release.

If you are a consumer of the XULRunner stub this means that you will no longer have a Mozilla produced version after 41.0. For folks in this group, you have two options:

  • Change your app to run through the stub provided by Firefox. Many apps will continue to work as before by simply replacing "xulrunner.exe application" with "firefox -app application.ini".
  • Build XULRunner yourself.

Buildbot <-> Taskcluster Bridge Now in Production

A few weeks ago I gave a brief overview of the Buildbot <->Taskcluster Bridge that we've been developing, and Selena provided some additional details about it yesterday. Today I'm happy to announce that it is ready to take on production work. As more and more jobs from our CI infrastructure move to Taskcluster, the Bridge will coordinate between them and jobs that must remain in Buildbot for the time being.

What's next?

The Bridge itself is feature complete until our requirements change (though there's a couple of minor bugs that would be nice to fix), but most of the Buildbot Schedulers still need to be replaced with Task Graphs. Some of this work will be done at the same time as porting specific build or test jobs to run natively in Taskcluster, but it doesn't have to be. I made a proof of concept on how to integrate selected Buildbot builds into the existing "taskcluster-graph" command and disable the Buildbot schedulers that it replaces. With a bit more work this could be extended to schedule all of the Buildbot builds for a branch, which would make porting specific jobs simpler. If you'd like to help out with this, let me know!

Buildbot Taskcluster Bridge - An Overview

Mozilla has been using Buildbot as its continuous integration system for Firefox and Fennec for many years now. It enabled us to switch from a machine-per-build model to a pool-of-slaves model, and greatly aided us in getting to our current scale. But it's not perfect - and we've known for a few years that we'll need to do an overhaul. Lucky for us, the FirefoxOS Automation team has built up a fantastic piece of infrastructure known as Taskcluster that we're eager to start moving to.

It's not going to be a small task though - it will take a lot more work than taking our existing build scripts and running them in Taskcluster. One reason for this is that many of our jobs trigger other jobs, and Buildbot manages those relationships. This means that if we have a build job that triggers a test job, we can't move one without moving the other. We don't want to be forced into moving entire job chains at once, so we need something to help us transition more slowly. Our solution to this is to make it possible to schedule jobs in Taskcluster while still implementing them in Buildbot. Once the scheduling is in Taskcluster it's possible to move individual jobs to Taskcluster one at a time. The software that makes this possible is the Buildbot Bridge.

The Bridge is responsible for synchronizing job state between Taskcluster and Buildbot. Jobs that are requested through Taskcluster will be created in Buildbot by the Bridge. When those jobs complete, the Bridge will update Taskcluster with their status. Let's look at a simple example to see see how the state changes in both systems over the course of a job being submitted and run:

Event Taskcluster state Buildbot state
Task is created Task is pending --
Bridge receives "task-pending" event, creates BuildRequest Task is pending Build is pending
Build starts in Buildbot Task is pending Build is running
Bridge receives "build started" event, claims the Task Task is running Build is running
Build completes successfully Task is running Build is completed
Bridge receives "build finished" event, reports success to Taskcluster Task is resolved Build is completed

The details of how this work are a bit more complicated - if you'd like to learn more about that I recommend watching the presentation I did about the Bridge architecture, or just have a read through my slides