May 222013
 

One of the the most important systems I work is the release automation for Firefox and Thunderbird. The process behind the automation long predates me, but I’ve been deeply involved in automating, refining, and optimizing it. It shouldn’t come as any surprise that one of the biggest challenges of working on such a complex system is understanding how the smaller pieces fit together to make the whole system. For the release automation we have an advantage though: the smaller pieces are generally Buildbot Builders, and the things that fit them together are generally Buildbot Schedulers. Awhile ago I was improving parallelism for l10n repacks and found it extremely difficult to reason about whether or not my changes would actually create the desired Builders and string them together correctly. I threw together some (terrible) code that spat out a digraph of the release automation’s Builders and Schedulers. By comparing the before and after graphs I was able to iterate on some parts of my code without spending hours and hours testing.

This week I finally got around to tidying up and packaging this code as a more general purpose tool. It’s not nearly complete and has many rough edges, but as a very basic tool to help you understand non-trivial Buildbot installations, I think it’s wonderful. It’s pip installable (“buildbot-scheduler-graph”) and available on Github. Once you’ve got it, try it out with “buildbot-scheduler-graph /path/to/your/master.cfg /path/to/output-dir”. Here’s what Mozilla’s scheduler graphs looks like. What do yours look like?

May 212013
 

tl;dr: Use imp.load_source.

I’ve been hacking on a tool on and off that needs to load Python code from badly named files (eg, “master.cfg”). To my surprise, there wasn’t an obvious way to do this. My “go to” method of doing this is with execfile. For example, this will load the contents of master.cfg into “m”, with each top level object as a key:

m = {}
execfile("master.cfg", m)

This works well enough for simple cases, but what happens when you try to load a module that loads other modules? It turns out that execfile has a nasty limitation of requiring modules that aren’t in sys.path to be in the same directory as the file that calls execfile. You can’t even chdir your way around this, you have to copy the files you need to the caller’s directory. (We actually have some production code that does this.

Someone in #python on Freenode suggested using importlib. That seemed like a fine idea, especially after recently watching Brett Cannon’s “How Import Works” talk. Unfortunately, Python 2.7′s importlib only has a single method which can only load a module by name.

Eventually I came across a Stack Overflow post that pointed me at imp.load_source. This function is similar to execfile in that it loads Python code from a named file. However, it properly handles imports without the need to copy files around. It also has the nice added bonus of returning a module rather than throwing objects into a dict. I ended up with code like this, to load the contents of “foo/bar/master.cfg”:

>>> import os, sys
>>> os.chdir("foo/bar")
>>> sys.path.insert(0, "") # Needed to ensure that the current directory is looked at when importing
>>> m = imp.load_source("buildbot.master.cfg", "master.cfg")

Problem solved!

Feb 112013
 

Over the years we’ve had quite a few run-ins with Windows’ maximum path and command line lengths. These problems exhibit themselves with a cryptic “Bad file number” error and usually leave someone scratching their head for a bit. Due to the variable length of our build directories these problems are prone to biting us at inconvenient times – in particular, during release builds. They also have the potential to bite us on merge day or anytime cross-branch merging happens.

To prevent this from happening in the future I’m going to be landing a change to our build directory naming this week. After it lands we’ll be padding build directories with “0″ until they reach the required length. For example, “try-w32″ becomes “try-w32-0000000000000000000000″ and “rel-m-beta-w32_bld” becomes “rel-m-beta-w32_bld-00000000000″. Because all build directories across all branches (including try) will be consistent, this means that any changes that would’ve eventually failed because of path or command line length, will now fail during the original landing.

This change should be mostly invisible to developers, except for the extra characters in paths when looking at build logs.

Jan 302013
 

A few weeks ago we had a very strange problem with a release. We didn’t have the time to investigate it, so we added a hack to workaround it and carried on. I’ve been slowly digging deeper into the issue this week and finally got to the bottom of it. As it turns out the root problem was that our release builds weren’t using the same “make” as the non-release builds! This is absolutely unintended, and could’ve been the source of even more, subtler errors. Eventually, we would’ve hit a problem that we couldn’t have hacked around, and would’ve had to make scary changes under time pressure to fix it.

This is a great example of why it’s important to remove hacks from your code and take the time to understand the errors you hit. You can clean your house all you want but if you don’t find out where the mess is coming from you’ll never keep up with it.

Jan 092013
 

Hi folks,

With apologies for the short notice, this post is to let you all know that 32-bit linux tests will be completely disabled on mozilla-central, mozilla-inbound, try, and all other mozilla-central-based branches for the next few days. Our B2G developers are working very hard to finish up 1.0, and need every cycle they can get on these machines to aid them in that.

A few notes, for clarity:
* 32-bit Linux builds are completely unaffected
* “make check” tests will still be run, because they run as part of the build job
* All of the disabled jobs will be re-enabled next week

Again, apologies for the short notice and we appreciate your understanding.

- Ben

Jan 072013
 

For as long as I can remember #build has been the Release Engineering (née Build & Release) home on IRC. We’ve learned over the years that most people who stop by #build randomly ask questions related to the build _system_, not the infrastructure. This ends up with us redirecting them to #pymake, #developers, or other channels where they’re more likely to get help. In order to help avoid this confusion we’ve decided to move RelEng to #releng and let #build be the home for the build system, likely to replace #pymake.

If you’ve got build infrastructure questions or just like to hang out with us, join us in #releng!

Dec 062012
 

As I mentioned last week, we’ve been working hard to get multilocale B2G builds going. This morning we flipped the switch and turned on multilocale Gaia across the board. The existing desktop and device builds on TBPL will now include a Gaia profile with Arabic, English, Spanish, French, Brazilian Portuguese, and Mandarin Chinese (zh-TW).

Additionally, we now have new desktop builds with all of the locales listed in this file available — meaning that most people localizing B2G can now test their translations in-app. These are available for Linux, Mac, and Windows in the “localizer” packages in this directory.

A few notes:

  • Linux users who have run a B2G desktop build in the past may need to rm -rf ~/.mozilla/b2g before these new builds will function correctly.
  • Gecko is still en-US only. Expect network errors and other such things to be in en-US for now. Multilocale gecko for B2G work is being tracked in bug 817197 and the bugs that it blocks.
  • All of the desktop builds will be updated on a nightly basis. However, there are no automatic updates for them – you must download by hand whenever you want to test newer code or translations.
Nov 282012
 

Currently, all of the b2g builds on TBPL are using a fixed set of locales which are built into the Gaia repository. These were OK at first, but we’re at the point now where we need to be able to test the languages that we’ll be initially shipping, as well as provide some way for localizers to test out their work. For these reasons, the following changes will be made to the B2g desktop and device builds:
* Unagi, otoro, panda, and the current desktop b2g builds will include 6 locales (instead of the 4 they currently do): Arabic (ar), English (en-US), French (fr), Spanish (es), Brazilian Portuguese (pt-BR), and Mandarin Chinese (zh-TW). These builds are intended for developer consumption and help to test out a wide array of features (rtl languages, languages with long strings, unicode characters, etc).
* We will be adding new desktop b2g builds that contain all languages that Gaia is available in (see https://github.com/mozilla-b2g/gaia/blob/master/shared/resources/languages-all.json for the full list). These builds are intended for localizers to see how their translations look and feel.

These changes should be taking place sometime this week. Big thanks to Staś Małolepszy for adding support for this to the Gaia build system.

Note that for now, the Gecko portions (for example, network error pages in the Browser) will not be localized. We will be enabling localization for Gecko as soon as we can, but it’s not quite ready yet.

Nov 122012
 

It’s not uncommon to see unit test failures, performance issues, or build problems that seem to happen exclusively on our pool of build & test machines. If you’ve ever been plagued by one of these you know what a pain in the butt it is to debug such a thing. Because of this, RelEng is always willing to loan out slaves to anyone who needs to do on-machine debugging. All you need to do is file a bug! These requests can usually be turned around in the same business day. If you think having access will save you time, do not hesitate – file today!

Sep 192012
 

We’re in the process of developing a new version of our update server. It’s at the point now where we have a development environment set-up for it, and slaves attempt to submit data to it. Recently we came across an issue where slaves were unable to successfully validate the SSL certificate of the development environment. Specifically, it was raising this error from OpenSSL:

requests.exceptions.SSLError: [Errno 1] _ssl.c:480: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Our client uses the Python Requests library, and gives it a specific certificate to validate the server certificate with. For comparison we tried using openssl’s s_client with the same cert, which was able to validate the certificate just fine given the same inputs. HRM.

I tried tons of things to get it to work – different versions of the Requests library, giving it the root + intermediary cert, different versions of Python – but nothing worked! I couldn’t even reproduce the behaviour with pure openssl no matter what inputs I gave it. I reached out to #security for support and Yvan couldn’t even reproduce the problem on his Mac!

Eventually I reached out to #python-requests, thinking it was a bug in the library. Someone there suggested strace’ing both my python script and openssl. I did that, and found something very interesting: despite being given an explicit certificate bundle, openssl fell back onto the system certificates — my python script didn’t. To verify this, I changed my script to look at the full system root rather than just a specific certificate or bundle. After doing that, everything worked fine.

My belief is that one of PyOpenSSL, urllib3, or Requests doesn’t know how to look at the system certificate store at all on Windows or Linux. On Mac, it seems to fall back just fine on it. One thing is certain: security is hard, especially debugging it.

Big thanks to Jake Maul, Rail, Yvan Boily, Dveditz, and #python-requests for helping me debug this.