Contribution opportunity: Release Engineering systems

Ben Hearsum

2013-12-02 09:35

Release Engineering runs a vast array of infrastructure and systems that do much of the continuous integration and releases for Mozilla. Many of our systems are small in their scope but must be able to scale up to support the incredible load that developers put on them. Other systems receive millions of requests every day from live Firefox, Fennec, and Thunderbird installations.

Do you want help developer productivity or get releases into users hands more quickly and efficiently? Do you want to gain experience working on systems that must work at scale? If so, Release Engineering is a great place to look. Below are a few interesting bugs that could use some attention. If you're interested in working on any of them I'm interested in mentoring you. You should be familiar with Python, but you don't need to be an expert. Have a look below and contact me directly if anything interests you.

Partial update generation service: Arguably, updates are the most important part of release process. Partial updates in particular help us keep a good user experience by reducing the amount of data a user needs to download, which means they update more quickly. We generate many of these already but creating this service would allow much more flexibility over what and when we generate partial updates. This project would involve writing the service from scratch, most likely in Python.
Update Balrog schema to support multiple partials: Balrog is the code name of our new update server (which I've previously blogged about). It's original design came about before we supported serving partial updates to users on multiple older versions of Firefox. In order to start using Balrog for Betas and Releases we need to add this feature. Balrog is written in Python and this will mostly involve server side changes to it.
Improve update verify output: "Update verify" is a very important test that we run as part of our release automation. It's job is to make sure that all users, regardless of where they're coming from, end up in the same state after updating to the latest release. It's output currently consists of thousands and thousands of lines of text, with test results interspersed. This bug is about finding and implementing a way to make the output easier for a human to make sense of and parse upon failure. The update verify scripts are written in bash, but this could be implemented by modifying them or post-processing the output.
Store history of machine actions requested through API: We recently deployed a new system that helps us manage our thousands of build and test machines. It aims to be a single entry point for information gathering and common operations on them. Currently, the data in it is volatile -- all history of operations is lost when the server is restarted. This bug will involve adding permanent storage (maybe SQL, maybe something else) to that server, which is written in Python.

Comments