Balrog in 2016 & 2017

This past year has been a big and exciting one for Balrog. We made the transition to Docker, migrated to new infrastructure, participated in Google's Summer of Code for the first time, bootstrapped System Addon updates, and much, much more. In total, 129 tickets were closed, with a significant portion of those being done by volunteers. I'd like to highlight a few things in particular.

Toolchain Improvements

Early on in the year I spent a good deal of time modernizing Balrog's toolchain. I upgraded most of the Python packages, switched to Tox, added Taskcluster for CI, and started using Docker for local development and testing. All of these things combined made it dead-simple to run Balrog on your laptop. This helps with reproducing bugs and testing new work, but the most important aspect ended up being how easy it made it for new developers to get to work. Without this change I wouldn't have felt comfortable proposing Balrog work for Summer of Code, and I highly doubt we would've had many (if any) volunteers working on it.

Looking back on this, I'm now firmly of the belief that if you have some extra time, spending it on things that reduce developer friction is one of the best things you can do. For new projects, make sure you build in time for this at the start.

Volunteers are Awesome

Speaking of volunteers, we've got some awesome ones, and that's pretty much a first for Release Engineering. We've always had a tough time opening up our work to volunteers, primarily because our systems are often difficult to hack on locally. All the work I just talked about to reduce developer friction made this a non-issue for Balrog, and opened the door to start advertising good first bugs, and applying for Summer of Code. I was so amazed and impressed with how many people came knocking when it was made much easier to get started.

But as I look back, I've also realized that there's another part to it: spending the time. When I submitted a Summer of Code proposal, I knew I would have to spend some time mentoring, and it helped put me in the right mindset when volunteers started coming along. While some will just send a PR with a great patch out of the blue, others need a bit more guidance to get started. This doesn't mean that the latter group are less skilled or less valuable, so you shouldn't ignore them. In fact, one of the most active contributors to Balrog is someone that started out in this group. If you can find the time to help these people learn and grow, it can pay dividends down the road.

I want to send out a special thanks to a few of our volunteers in particular. Varun Joshi, our Summer of Code student, who improved the efficiency of our l10n update submissions, and did much of the initial work on System Addons. Njira Perci, who has done the vast majority of the UI work in the past year (I'm pretty sure she's our resident expert on it now!). And Ninad Bhat, who did all of the other work on System Addons, and is now helping out a lot with Multiple Signoff. Without you three I don't know what we would've done!

Transition to Cloud Operations

Early in the year we made the decision to move the production infrastructure of Balrog to CloudOps' platform. While there was a lot of small details to figure out, this went extremely well, and I want to thank the Web Operations, Database Operations, and Cloud Operations teams for making it so smooth. Because it's AWS and Docker based, we now have much more control over the production stack, and are able to scale much more easily as we add more load to Balrog.

System Addon Updates

One of the things I'm most proud of this year is how quickly we were able to spin up updates for System Addons. In the past this may have been a huge headache, but because of Balrog's flexible design we were able to get them working quite quickly, and make some improvements later. Ninad in particular spent considerable time making improvements to that process.

Multifile Updates

Since we began shipping them, Balrog has served updates to Gecko Media Plugins. And since we added a second one, they've been a huge headache for us to manage. After Varun implemented multifile updates, shipping new versions of these plugins became trivial.

Scheduled Changes

Code-wise, this was my biggest accomplishment of the year. The Scheduled Changes work allows us queue changes to be enacted at a later time, and once we have lower latency ADI information, release uptake. Once the latter is available, we'll be able to do much better throttled rollouts rather than the guesswork we currently do. This system has also become the basis for the Multiple Signoff work we started in Q4.

2017

This past year has been a great one for Balrog, but I expect 2017 to be even better. Here's some of the things we're looking at for 2017:

  • Finishing up the work on Multiple Signoffs, which will make Balrog much more resilient to bad actors or credential theft.
  • Digging into unifying update requests that Firefox makes. Laura talked a lot about this at Mozloha, and I think we can greatly improve uptake of a lot of different things that Firefox queries for.
  • Authentication improvements. I'd like to look into switching to Okta or another service that allows for MFA.
  • Getting more involved with Cloud Operations QA, who want to help us with better load and contract testing in our deployment pieline.
  • Various improvements to the Rules:
  • Investigating some architectural changes such as upgrading to Python 3 or adding a service layer.

Here's to another great year of Balrog development! Happy Firefox and Balrog ^.^
(Photo credit: Donna Oberes)

Rings of Power - Multiple Signoff in Balrog

I want to tell you about an important new Balrog feature that we're working on. But I also want to tell you about how we planned it, because I think that part is even more interesting that the project itself.

The Project

Balrog is Mozilla’s update server. It is responsible for deciding which updates to deliver for a given update request. Because updates deliver arbitrary code to users this means that a bad data in update server could result in orphaning users, or be used as an attack vector to infect users with malware. It is crucial that we make it more difficult for a single user’s account to make changes that affect large populations of users. Not only does this provide some footgun protection, but it safeguards our users from attacks if an account is compromise or an employee goes rogue.

While the current version of Balrog has a notion of permissions, most people effectively have carte-blanche access to one or more products. This means that an under-caffeinated Release Engineer could ship the wrong thing, or a single compromised account can begin an attack. Requiring multiple different accounts to sign off on any sensitive changes will protect us against both of these scenarios.

Multiple sign offs may also be used to enhance Balrog’s ability to support workflows that are more reflective of reality. For example, the Release Management team are the final gatekeepers for most products (ie: we can’t ship without their sign off), but they are usually not the people in the best place to propose changes to Rules. A multiple sign off system that supports different types of roles would allow some people to propose changes and others to sign off on them.

The Planning Process

Earlier this year I blogged about Streamlining the throttled rollout of Firefox releases, which was the largest Balrog projects to-date at the time. While we did some up-front planning for it, it took significantly longer to implement than I'd originally hoped. This isn't uncommon for software projects, but I was still very disappointed with the slow pace. One of the biggest reasons for this was discovering new edge cases or implementation difficulties after we were deep into coding. Often this would result in needing to rework code that was thought to be finished already, or require new non-trivial enhancements to be made. For Multiple Signoff, I wanted to do better. Instead of a few hours of brainstorming, we've taken a more formal approach with it, and I'd like to share both the process, and the plan we've come up with.

Setting Requirements

I really enjoy writing code. I find it intellectually challenging and fun. This quality is usually very useful, but I think it can be a hinderance when in the early stages of large projects, as I tend to jump straight to thinking about implementation before even knowing the full set of requirements. Recognizing this, I made a concious effort to purge implementation-related thoughts until we had a full set of requirements for Multiple Signoff reviewed by all stakeholders. Forcing myself not to go down the (more fun) path of thinking about code made me spend more time thinking about what we want and why we want it. All of this, particularly the early involvement of stakeholders, uncovered many hidden requirements and things that not everyone agreed. I believe that identifying them at such an early stage made them much easier to resolve, largely because there was no sunk-cost to consider.

Planning the Implementation

Once our full set of requirements were written, I was amazed at how obvious much of the implementation was. New objects and pieces of data stood out like neon signs, and I simply plucked them out of the requirements section. Most of the interactions between them came very naturally as well. I wrote some use cases that acted almost as unit tests for the implementation proposal, and identified a lot of edge cases and bugs in the first pass of the implementation proposal. In retrospect, I probably should've written the use cases at the same time as the requirments. Between all of that and another round of review from stakeholders, I have significantly more confidence that the proposed implementation will look like the actual implementation than I have with any other projects of similar size.

Bugs and Dependencies

Just like the implementation flowed easily from the requirements, the bugs and dependencies between them were easy to find by rereading the implementation proposal. In the end, I identified 18 distinct pieces of work, and filed them all as separate bugs. Because the dependencies were easy to identify, I was able to convince Bugzilla to give me a decent graph that helps identify the critical path, and which bugs are ready to be worked on.

Multiple Signoff Bug Graph

But will it even help?

Overall, we probably spend a couple of people-weeks of active time on the requirements and implementation proposal. This isn't an overwhelming amount of time upfront, but it's enough that it's important to know if it's worthwhile next time. This is a question that can only be answered in retrospect. If the work goes faster and the implementation has less churn, I think it's safe to say that it was time well spent. Those are both things that are relatively easy to measure, so I hope to be able to measure this fairly objectively in the end.

The Plan

If you're interested in reading the full set of requirements, implementation plan, and use cases, I've published them here. A HUGE thanks goes out to Ritu, Nick, Varun, Hal, Johan, Justin, Julien, Mark, and everyone else who contributed to it.

What's New with Balrog - October 25th, 2016

The past month has seen some significant and exciting improvements to Balrog. We've had the usual flow of feature work, but also a lot of improvements to the infrastructure and Docker image structure. Let's have a look at all the great work that's been done!

Core Features

Most recently, two big changes have landed that allow us to use multifile updates for SystemAddons. This type of update configuration let us simplify configuration of Rules and Releases, which is one of the main design goals of Balrog. Part of this project involved implementing "fallback" Releases - which are used if an incoming request fails a throttle dice roll. This benefits Firefox updates as as well, because it will allow us to continue serving updates to older users while we're in the middle of a throttled rollout.

Some house cleaning work has been done to remove attributes that Firefox no longer supports, and add support for the new "backgroundInterval" attribute. The latter will give us server-side control of how fast Firefox downloads updates, which we hope will help speed up uptake of new Releases.

There's been some significant refactoring of our Domain Whitelist system. This doesn't change the way it works at all, but cleaning up the implementation has paved the way for some future enhancements.

General Improvements

E-mails are now sent to a mailing list for some types of changes to Balrog's database. This serves as an alert system that ensures unexpected changes don't go unnoticed. In a similar vein, we also started aggregating exceptions to CloudOps' Sentry instance, which has already covered numerous production-only errors that have gone unnoticed for months.

Significant improvements have been made to the way our Docker images are structured. Instead of sharing one single Dockerfile for production and local dev, we've split them out. This has allowed the production image to get a lot smaller (mostly thanks to Benson's changes). On the dev side, it has let us improve the local development workflow - all code (including frontend) is now automatically rebuilt and reloaded when changed on the host machine. And thanks to Stefan, we even support development on Windows now!

We now have a script that extracts what the "active data" from the production database. When imported into a fresh database (ie: local database, or stage), it will serve exactly the same updates as production without all of the unnecessary history. This should make it much easier to reproduce issues locally, and verify that stage is functionally correctly.

Finally, something that's been on my wishlist for a long time finally happened as well: the Balrog Admin API Client code has been moved into the Balrog repo! Because it is so closely linked with the server side API, integrating them makes it much easier to keep them in sync.

The People

The work above was only possible because of all the great contributors to Balrog. A big thanks goes to Ninad, Johan, Varun, Stefan, Simon, Meet, Benson, and Njira for all their time and hard work on Balrog!

What's New with Balrog - September 14th, 2016

The pace of Balrog development has been increasing since the beginning of 2016. We're now at a point where we push new code to production nearly every week, and I'd like to start highlighting all the great work that's being done. The past week was particularly busy, so let's get into it!

The most exciting thing to me is Meet Mangukiya's work on adding support for substituting some runtime information when showing deprecation notices. This is Meet's first contribution to Balrog, and he's done a great job! This work will allow us to send users to better pages when we've deprecated support for their platform or OS.

Njira has continued to chip away at some UI papercuts, fixing some display issues with history pages, addressing some bad word wrapping on some pages, and reworking some dialogs to make better use of space and minimize scrolling.

A few weeks ago Johan suggested that it might be time to get rid of the submodule we use for UI and integrate it with the primary repository. This week he's done so, and it's already improved the workflow for developers.

For my part, I got the final parts to my Scheduled Changes work landed - a project that has been in the works since January. With it, we can pre-schedule changes to Rules, which will help minimize the potential for human error when we ship, and make it unnecessary for RelEng to be around just to hit a button. I also fixed a regression (that I introduced) that made it impossible to grant full admin access, oops!

Scheduled Changes UI

I also want to give a big thank you to Benson Wong for his help and expertise in getting the Balrog Agent deployed - it was a key piece in the Scheduled Changes work, and went pretty smoothly!

Platform Operations Project of the Month: Balrog

Hello from Platform Engineering Ops! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is Balrog!

What is it?

Balrog is the software that runs the server side component of the update system used by Firefox and other Mozilla products. It is the successor to AUS (Application Update Service), which did not scale to our current needs nor allow us to adapt to more recent business requirements. Balrog helps us ship updates faster and with much more flexibility than we've had in the past.

How it works

At the heart of Balrog is its Rules table. The Rules allow us to filter incoming update requests by their product, channel, and other fields and using that information, respond with the correct Release. One of the key design goals of Balrog is to make these Rules easy for humans to understand and manipulate, while providing enough flexibility to meet business requirements.

Releases are the other major part of Balrog. These are models that contain metadata (platforms, locales, payload URLs, etc.) about a set of binaries that we ship (eg: Firefox 49.0).

You can learn more about Balrog's architecture and how it works on the main wiki page.

Contributing

Earlier this year we modernized Balrog's toolchain, and it's now in a place where anyone with Python or JS experience should be able to contribute succesfully. Balrog has excellent unit test coverage, which will help you give confidence in any changes you make.

If you have Docker installed already, you can play with it with a few simple commands:

git clone https://github.com/mozilla/balrog
cd balrog
git submodule init
git submodule update
docker-compose up

Once running, you can access the admin interface at http://localhost:8080. You should see a splash page like this: Balrog's splash page

Good First Bugs

We do our best to always have good first bugs ready for new contributors. Here's a few of the current ones (you can find the rest on the wiki):

If you have any questions or need help you're welcome to join us in #balrog on irc.mozilla.org - we're happy to help!

Building and Pushing Docker Images with Taskcluster-Github

Earlier this year I spent some time modernizing and improving Balrog's toolchain. One of my goals in doing so was to switch from Travis CI to Taskcluster both to give us more flexibility in our CI configuration, as well as help dogfood Taskcluster-Github. One of the most challenging aspects of this was how to build and push our Docker image, and I'm hoping this post will make it easier for other folks who want to do the same in the future.

The Task Definition

Let's start by breaking down Task definition from Balrog's .taskcluster.yml. Like other Taskcluster-Github jobs, we use the standard taskcluster.docker provisioner and worker.

  - provisionerId: "{{ taskcluster.docker.provisionerId }}"
    workerType: "{{ taskcluster.docker.workerType }}"

Next, we have something a little different. This section grants the Task access to a secret (managed by the Secrets Service). More on this later.

    scopes:
      - secrets:get:repo:github.com/mozilla/balrog:dockerhub

The payload has a few things of note. Because we're going to be building Docker images it makes sense to use Taskcluster's image_builder Docker image as well as enabling the docker-in-docker feature. The taskclusterProxy feature is needed to access the Secrets Service.

    payload:
      maxRunTime: 3600
      image: "taskcluster/image_builder:0.1.3"
      features:
        dind: true
        taskclusterProxy: true
      command:
        - "/bin/bash"
        - "-c"
        - "git clone $GITHUB_HEAD_REPO_URL && cd balrog && git checkout $GITHUB_HEAD_BRANCH && scripts/push-dockerimage.sh"

The extra section has some metadata for Taskcluster-Github. Unlike CI tasks, we limit this to only running on pushes (not pull requests) to the master branch of the repository. Because only a few people can push to this branch, it means that only these can trigger Docker builds.

    extra:
      github:
        env: true
        events:
          - push
        branches:
          - master

Finally, we have the metadata, which is just standard Taskcluster stuff.

    metadata:
      name: Balrog Docker Image Creation
      description: Balrog Docker Image Creation
      owner: "{{ event.head.user.email }}"
      source: "{{ event.head.repo.url }}"

Secrets

I mentioned the "Secrets Service" earlier, and it's the key piece that enables us to securely push Docker images. Putting our Dockerhub password in it means access is limited to those who have the right scopes. We store it in a secret with the key "repo:github.com/mozilla/balrog:dockerhub", which means that anything with the "secrets:get:repo:github.com/mozilla/balrog:dockerhub" scope is granted access to it. My own personal Taskcluster account has it, which lets me set or change the password:

We also have a Role called "repo:github.com/mozilla/balrog:branch:master" which has that scope:

You can see from its name that this Role is associated with the Balrog repository's master branch. Because of this, any Tasks created for as a result of pushes to that branch in that repository and branch may assign the scopes that Role has - like we did above in the "scopes" section of the Task.

Building and Pushing

The last piece of the puzzle here is the actual script that does the building and pushing. Let's look at a few specific parts of it.

To start with, we deal with retrieving the Dockerhub password from the Secrets Service. Because we enabled the taskclusterProxy earlier, "taskcluster" resolves to the hosted Taskcluster services. Had we forgotten to grant the Task the necessary scope, this would return a 403 error.

password_url="taskcluster/secrets/v1/secret/repo:github.com/mozilla/balrog:dockerhub"
dockerhub_password=$(curl ${password_url} | python -c 'import json, sys; a = json.load(sys.stdin); print a["secret"]["dockerhub_password"]')

We build, tag, and push the image, which is very similar to building it locally. If we'd forgotten to enable the dind feature, this would throw errors about not being able to run Docker.

docker build -t mozilla/balrog:${branch_tag} .
docker tag mozilla/balrog:${branch_tag} "mozilla/balrog:${date_tag}"
docker login -e $dockerhub_email -u $dockerhub_username -p $dockerhub_password
docker push mozilla/balrog:${branch_tag}
docker push mozilla/balrog:${date_tag}

Finally, we attach an artifact to our Task containing the sha256 of the Docker images. This allows consumers of the Docker image to verify that they're getting exactly what we built, and not something that may have been tampered on Dockerhub or in transit.

sha256=$(docker images --no-trunc mozilla/balrog | grep "${date_tag}" | awk '/^mozilla/ {print $3}')
put_url=$(curl --retry 5 --retry-delay 5 --data "{\"storageType\": \"s3\", \"contentType\": \"text/plain\", \"expires\": \"${artifact_expiry}\"}" ${artifact_url} | python -c 'import json; import sys; print json.load(sys.stdin)["putUrl"]')
curl --retry 5 --retry-delay 5 -X PUT -H "Content-Type: text/plain" --data "${sha256}" "${put_url}"

The Result

Now that you've seen how it's put together, let's have a look at the end result. This is the most recent Balrog Docker build Task. You can see the sha256 artifact created on it:

And of course, the newly built image has shown up on the Balrog Dockerhub repo:

A Flurry of Balrog Activity

This past quarter I spent some time modernizing Balrog's toolchain to make it more approachable. We've switched from Vagrant to Docker, cleaned up setup.py, started using tox, and updated the sample data included in the repo. At the same time, I started identifying some good first bugs, and put together a proposal for a Summer of Code project.

I feel very lucky that this work has paid off so quickly. There's been great interest in the Summer of Code project, and we've had 5 new volunteers submit patches to Balrog. These people are doing really great work, and I'd like to highlight their contributions today (in no particular order).

Njira Perci

Njira has focused on UI improvements, and has already improved the confirmation dialog for deleting Rules and added the ability to autocomplete Products and Channels in form fields. She continues to hack away and is now working on improving the Releases UI to highlight whether or not a release is active in any way.

Ashish Sareen

Ashish fixed a bug where the Admin server would hit an ISE 500 under certain conditions. With his patch, it now correctly returns a 400 error to the client.

Varun Joshi

Varun has been diving deep into the Admin server. He started off by fixing a small bug where some weirdly formed Releases caused ISE 500s, and has since provided a patch that gives us the ability to mark Releases as "read only". This is something we intend to make use of in our new Release Promotion system to guard accidental (or malicious...) changes to Release metadata.

Aybüke Özdemir

Aybüke enhanced the UI to show rule_ids, which makes it esaier for humans to find them when they need to put them into a script or automation.

Kumar Rishabh

Kumar fixed a very annoying bug where diffs of different versions of Releases would be generated against the wrong base version, making them essentially useless.

You?

If you would like to get involved in the development of Balrog, we'd love to have you. The wiki page can get you bootstrapped, and you can find us all on irc.mozilla.org in #balrog.

Streamlining throttled rollout of Firefox releases

When we ship new versions of Firefox we do our best to avoid introducing new bugs or crashes, especially those that affect large numbers of users. One of the strategies we use to accomplish this is to ship new versions to a subset of people before shipping to everyone. We call this a "throttled rollout", and it's something we've been doing for many years. The tricky part of this is getting the new version to enough users to have a representative sample size without overshooting our target.

Our current process for doing this is as follows:

  • Enable updates to the new version at a rate of 25% (meaning 25% of update requests will be offered the new version)
  • Wait ~24 hours
  • Turn the update rate down to 0%
  • Hope that we hit our target without going over

The rate and time period has been tuned over time, but it's still a very fragile process. Sometimes we get more or fewer update requests than expected during the 24h window. Sometimes we forget to set the rate back down to 0%. A process that's driven manually and dependent on guesswork has a lot of things that can go wrong. We can do better here. What if we could schedule rate changes to avoid forgetting to make them? What if we could monitor real-time uptake information to eliminate the guesswork? Nick and I have come up with a plan that allows Balrog to do these things, and I'm excited to share it.

Enter: Balrog Agent

The Balrog Agent will be a new component of the system that can be configured to enact changes to update rules at specific times or when certain conditions are met. We will allow users to schedule changes through Admin UI, the Agent will watch for their condition(s) to be hit, and then enact the requested change. For now the only condition we will support is uptake of a specific version on a specific channel, which we will soon be able to get from Telemetry. This diagram shows where the Agent fits into the system:



Once implemented, our new process could look something like:

  • Add scheduled change to enable updates to the new version at a rate of 25%
  • Add scheduled change to turn the update rate down to 0% after we hit our target uptake
  • Let the Balrog Agent do the rest

Unlike the manual changes in our current process, the creation of both scheduled changes is not time sensitive - it can be done at any point prior to release day. This means that humans don't have to be around and/or remember to flip bits at certain times, nor do we have to worry about tweaking the time windows as our uptake rate changes. It Just Works (tm).

As always, security was a concern when designing the Balrog Agent. We don't want it to have root-like access to the Balrog database, we just want it to be able to make the specific changes that users have already set-up. To satisify this requirement, we'll be adding a special endpoint to the Admin API (something like /rules/scheduled_changes) which can only enact changes that users have previously scheduled. When users schedule new changes through the UI, Balrog will ensure that they have permission to make the change they're scheduling. The Agent will use the new endpoint to enact changes, which prevents it from making changes that a user didn't explicitly request. As with other parts of Balrog's database, the history of scheduled changes and when they were enacted will be kept to ensure that they are auditable.

Because this is the first time we'll have an automated system making changes to update rules at unpredictible times, another concern that came up was making sure that humans are not surprised when it happens. It's going to feel weird at first to have the release channel update rate managed by automation. To minimize the surprise and confusion of this we're planning to have the Agent send out e-mail before making changes. This serves as a heads up us humans and gives us time to react if the Agent is about to make a change that may not be desired anymore.

We know from past experience that it's impossible for us to predict the interesting ways and conditions we'll want to offer updates. One of the things I really like about this design is that the only limit to what we can do is the data that the Agent has available. While it's starting off with uptake data, we can enhance it later to look at Socorro or other key systems. Wouldn't it be pretty cool if we automatically shut off updates if we hit a major crash spike? I think so.

If you're interested in the nitty-gritty details of this project there's a lot more information in the bug. If you're interested in Balrog in general, I encourage you to check out the wiki or come chat with us on IRC.

Collision Detection and History with SQLAlchemy

Balrog is one of the more crucial systems that Release Engineering works on. Many of our automated builds send data to it and all Firefox installations in the wild regularly query it to look for updates. It is an SQLAlchemy based app, but because of its huge importance it became clear in the early stages of development that we had a couple of requirements that went beyond those of most other SQLAlchemy apps, specifically:

  • Collision Detection: Changes to the database must always be done safely. Balrog must not allow one change to silently override another one.
  • Full History: Balrog must provide complete auditability and history for all changes. We must be able to associate every change with an account and a timestamp.

Implementing these two requirements ended up being a interesting project and I'd like to share the details of how it all works.

Collision Detection

Anyone who's used Bugzilla for awhile has probably encountered this screen before:



This screenshot shows how Bugzilla detects and warns if you try to make a change to a bug before loading changes someone else has made. While it's annoying when this screen slows you down, it's important that Bugzilla doesn't let you unknowingly overwrite other folks' changes. This is very similar to what we wanted to do in Balrog, except that we needed to enforce it at the API level, not just in the UI. In fact, we decided it was best to enforce it at the lowest level possible to minimize the change of needing to duplicate it in different parts of the app.

To do this, we started by creating a thin wrapper around SQLAlchemy which ensures that each table has a "data_version" column, and requires an "old_data_version" to be passed when doing an UPDATE or DELETE. Here's a slimmed down version of how it works with UPDATE:

class AUSTable(object):
    """Base class for Balrog tables. Subclasses must create self.table as an
    SQLAlchemy Table object prior to calling AUSTable.__init__()."""
    def __init__(self, engine):
        self.engine = engine
        # Ensure that the table has a data_version column on it.
        self.table.append_column(Column("data_version", Integer, nullable=False))

    def update(self, where, what, old_data_version):
        # Enforce the data_version check at the query level to eliminate
        # the possibility of a race condition between the time we can
        # retrieve the current data_version, and when we can update the row.
        where.append(self.table.data_version == old_data_version)

        with self.engine.connect().begin() as transaction:
            row = self.select(where=where, transaction=transaction)
            row["data_version"] += 1
            for col in what:
                row[col] = what[col]

            query = self.table.update(values=what):
            for cond in where:
                query = query.where(cond)
            ret = transaction.execute(query)

            if ret.rowcount != 1:
                raise OutdatedDataError("Failed to update row, old_data_version doesn't match data_version")

And one of our concrete tables:

class Releases(AUSTable):
    def __init__(self, engine, metadata):
        self.table = Table("releases", metadata,
            Column("name", String(100), primary_key=True),
            Column("product", String(15), nullable=False),
            Column("data", Text(), nullable=False),
        )

    def updateRelease(self, name, old_data_version, product, data):
        what = {
            "product": product,
            "data": data,
        }
        self.update(where=[self.table.name == name], what=what, old_data_version=old_data_version)

As you can see, the data_version check is added as a clause to the UPDATE statement - so there's no way we can race with other changes. The usual workflow for callers is to retrieve the current version of the data, modify it, and pass it back along with old data_version (most of the time retrieval and pass back happens through the REST API). It's worth pointing out that a client could pass a different value as old_data_version in an attempt to thwart the system. This is something we explicitly do not try to protect against (and honestly, I don't think we could) -- data_version is a protection against accidental races, not against malicious changes.

Full History

Having history of all changes to Balrog's database is not terribly important on a day-to-day basis, but when we have issues related to updates it's extremely important that we're able to look back in time and see why a particular issue happened, how long it existed for, and who made the change. Like collision detection, this is implemented at a low level of Balrog to make sure it's difficult to bypass when writing new code.

To achieve it we create a History table for each primary data table. For example, we have both "releases" and "releases_history" tables. In addition to all of the Releases columns, the associated History table also has columns for the account name that makes each change and a timestamp of when it was made. Whenever an INSERT, UPDATE, or DELETE is executed, the History table has a new row inserted with the full contents of the new version. These are done is a single transaction to make sure it is an all-or-nothing operation.

Building on the code from above, here's a simplified version of how we implement History:

class AUSTable(object):
    """Base class for Balrog tables. Subclasses must create self.table as an
    SQLAlchemy Table object prior to calling AUSTable.__init__()."""
    def __init__(self, engine, history=True, versioned=True):
        self.engine = engine
        self.versioned = versioned
        # Versioned tables (generally, non-History tables) need a data_version.
        if versioned:
            self.table.append_column(Column("data_version", Integer, nullable=False))

        # Well defined interface to the primary_key columns, needed by History tables.
        self.primary_key = []
        for col in self.table.get_children():
            if col.primary_key:
                self.primary_key.append(col)

        if history:
            self.history = History(self.table.metadata, self)
        else:
            self.history = None

    def update(self, where, what, old_data_version=None, changed_by=None):
        # Because we're a base for History tables as well as normal tables
        # these must be optional parameters, but enforced when the features
        # are enabled.
        if self.history and not changed_by:
            raise ValueError("changed_by must be passed for Tables that have history")
        if self.versioned and not old_data_version:
            raise ValueError("update: old_data_version must be passed for Tables that are versioned")

        # Enforce the data_version check at the query level to eliminate
        # the possibility of a race condition between the time we can
        # retrieve the current data_version, and when we can update the row.
        where.append(self.table.data_version == old_data_version)

        with self.engine.connect().begin() as transaction:
            row = self.select(where=where, transaction=transaction)
            row["data_version"] += 1
            for col in what:
                row[col] = what[col]

            query = self.table.update(values=what):
            for cond in where:
                query = query.where(cond)
            ret = transaction.execute(query)
            if self.history:
                transaction.execute(self.history.forUpdate(row, changed_by))
            if ret.rowcount != 1:
                raise OutdatedDataError("Failed to update row, old_data_version doesn't match data_version")


class History(AUSTable):
    def __init__(self, metadata, baseTable):
        self.baseTable = baseTable
        self.Table("%s_history" % baseTable.table.name, metadata,
            Column("change_id", Integer(), primary_key=True, autoincrement=True),
            Column("changed_by", String(100), nullable=False),
            Column("timestamp", BigInteger(), nullable=False),
        )

        self.base_primary_key = [pk.name for pk in baseTable.primary_key]
        # In addition to the above columns, we need a copy of each Column
        # from our base table.
        for col in baseTable.table.get_children():
            newcol = col.copy()
            # We have our own primary_key Column, and don't want our
            # base table's PK to be part of it.
            if col.primary_key:
                newcol.primary_key = False
            # And while the base table's primary key is always required for
            # history rows, all other columns (including those that are
            # required in the base table) must be nullable.
            else:
                newcol.nullable = True
            self.table.append_column(newcol)

        AUSTable.__init__(self, history=False, versioned=False)

    def forUpdate(self, rowData, changed_by):
        row = {}
        # Copy in the data that's about to be updated in the base table...
        for k in rowData:
            row[k] = rowData[k]
        # ...and add the extra data that we need to track history accurately.
        row["changed_by"] = changed_by
        row["timestamp"] = time.time()
        return self.table.insert(values=rows)

A key thing to notice here is that the History tables are maintained automatically with only a minor tweak to the query interface (addition of "changed_by"). And while not shown here, it's important to note that the History table objects are not queryable directly through any exposed API. Even if an attacker got access to Balrog's REST API with admin permissions, they cannot delete rows from those tables.

If you'd like to see the complete implementation of either of these, you can find it over in the Balrog repository.

Enhancements

These things were implemented a few years ago, and since then we've discovered a couple of rough edges that would be nice to fix.

The biggest complaint is that the History tables are extremely inefficient. Many of our Release objects are a few hundred kilobytes, which means every change to them (thousands per day) significantly grows the releases_history table. We've dealt with this by limiting how long we keep history for certain types of releases, but it's far less than ideal. We'd love to have a more efficient way of storing history. We've discussed storing history as diffs rather than full copies or compressing the data before inserting the rows, but haven't settled on anything yet. If you have any ideas about this we'd love to hear them!

I mentioned earlier how annoying it is when Bugzilla throws you a mid-air collision, and it's no different in Balrog. We get hundreds of them a day when locales l10n repacks all try to update the same Releases. These can be dealt with by retrying but it's very inefficient. We might be able to do a better here if we inspected the details of changes that collide, and only reject them if they try to modify the same parts of an object.

Finally, all of this awesome collision detection and history code is in no way tied to Balrog - the classes that implement it are already very generic. I would love to pull out these features and ship them as their own module, which Balrog (and hopefully others!) can then consume.

Configuring uWSGI to host an app and static files

This week I started using uWSGI for the first time. I'm in the process of switching Balrog from Vagrant to Docker, and I'm moving away from Apache in the process. Because of Balrog's somewhat complicated Apache config this ended up being more difficult than I thought. Although uWSGI's docs are OK, I found it a little difficult to put them into practice without examples, so here's hoping this post will help others in similar situations.

Balrog's Admin app consists of a pretty standard Python WSGI app, and a static Angular app hosted on the same domain. To complicate matters, the version of Angular that we use does not support being hosted anywhere except the root of the domain. It took a bit of futzing, but we came up with an Apache config to host both of these pieces on the same domain pretty quickly:

<VirtualHost *:80>
    ServerName balrog-admin.mozilla.dev
    DocumentRoot /home/vagrant/project/ui/dist/

    # Rewrite virtual paths in the angular app to the index page
    # so that refreshes/linking works, while leaving real files
    # such as the js/css alone.
    <Directory /home/vagrant/project/ui/dist>
        RewriteEngine On
        RewriteCond %{REQUEST_FILENAME} -f [OR]
        RewriteCond %{REQUEST_FILENAME} -d

        RewriteRule ^ - [L]
        RewriteRule ^ index.html [L]
    </Directory>

    # The WSGI app is rooted at /api
    WSGIScriptAlias /api /home/vagrant/project/admin.wsgi
    WSGIDaemonProcess aus4-admin processes=1 threads=1 maximum-requests=50 display-name=aus4-admin
    WSGIProcessGroup aus4-admin
    WSGIPassAuthorization On

    # The WSGI app relies on the web server to do the authentication, and will
    # bail if REMOTE_USER isn't set. To simplify things, we just set this
    # variable instead of prompting for auth.
    SetEnv REMOTE_USER balrogadmin

    LogLevel Debug
    ErrorLog "|/usr/sbin/rotatelogs /var/log/httpd/balrog-admin.mozilla.dev/error_log_%Y-%m-%d 86400 -0"
    CustomLog "|/usr/sbin/rotatelogs /var/log/httpd/balrog-admin.mozilla.dev/access_%Y-%m-%d 86400 -0" combined
</VirtualHost>

Translating this to uWSGI took way longer than expected. Among the problems I ran into were:

  • Using --env instead of --route's addvar action to set REMOTE_USER (--env turns out to be for passing variables to the overall WSGI app).
  • Forgetting to escape "$" when passing routes on the command line, which caused my shell to interpret variables intended for uWSGI
  • Trying to rewrite URLs to a static path, which I only discovered is invalid after stumbling on an old mailing list thread.
  • Examples from uWSGI's own documentation did not work! I discovered that depending on how it was compiled, you may need to pass "--plugin python,http" to give all of the necessary command line options for what I was doing.

After much struggle, I came up with an invocation that worked exactly the same as the Apache config:

uwsgi --http :8080 --mount /api=admin.wsgi --manage-script-name --check-static /app/ui/dist --static-index index.html --route "^/.*$ addvar:REMOTE_USER=balrogadmin" --route-if "startswith:\${REQUEST_URI};/api continue:" --route-if-not "exists:/app/ui/dist\${PATH_INFO} static:/app/ui/dist/index.html"

There's a lot crammed in there, so let's break it down:

  • --http :8080 tells uWSGI to listen on port 8080
  • --mount /api=admin.wsgi roots the "admin.wsgi" app in /api. This means that when you make a request to http://localhost:8080/api/foo, the application sees "/foo" as the path. If there was no Angular app, I would simply use "--wsgi-file admin.wsgi" to place the app at the root of the server.
  • --manage-script-name causes uWSGI to rewrite PATH_INFO and SCRIPT_NAME according to the mount point. This isn't necessary if you're not using "--mount".
  • --check-static /app/ui/dist points uWSGI at a directory of static files that it should serve. In my case, I've pointed it at the fully built Angular app. With this, requests such as http://localhost:8080/js/app.js returns the static file from /app/ui/dist/js/app.js.
  • --static-index index.html tells uWSGI to serve index.html when a request for a directory is made - the default is to 404, because there's no built-in directory indexing.
  • The --route's chain together, and are evaluated as follows:
  • If the requested path matches ^/.*$ (all paths will), set the REMOTE_USER variable to balrogadmin.
  • If the REQUEST_URI starts with /api do not process any more --route's; just satisfy the request. All requests intended for the WSGI app will end up matching here. REQUEST_URI is used instead of PATH_INFO because the latter is written by --manage-script-name
  • If the requested file does not exist in /app/ui/dist, serve /app/ui/dist/index.html instead. PATH_INFO and REQUEST_URI will still point at the original file, which lets Angular interpret the virtual path and serve the correct thing.

In the end, uWSGI seems to be one of the things that's very scary when you first approach it (I count about 750 command line arguments), but is pretty easy to understand when you get to know it a better. This is almost the opposite of Apache - I find it much more approachable, perhaps because there's such a littany of examples out there, but things like mod_rewrite are very difficult for me to understand after the fact, at least compared to uWSGI's --route's.