My week of buildduty

I've been on buildduty this week, which means I've been subject to countless interrupts to look at infrastructure issues, start Talos runs, cancel try server builds, et. al. It's not shocking to me that I got very little that I planned to done this week, but I'm still a bit surprised when I look at the full list of what I have done:

....and my Friday is only half over!

Which build infrastructure problems do you see the most?

I'm hoping to tackle bug 505512 (Make infrastructure related problems turn the tree a color other than red) in the next few weeks. Most of the ground work for it is laid, which means that most of what I'll be doing is parsing logs for infrastructure errors.

So, what errors do you see most from our build infrastructure? Are there other things that you would classify as infrastructure issues? Please add any suggestions you have to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors

Update on recent Tinderbox issues

My last post talked about the issues we've been having with load on the Tinderbox server and some ways we could fix it. I'm happy to report that two things were completed yesterday that should keep the load under control for the foreseeable future.

One of the things mentioned in my previous post, splitting incoming build processing from the rest of Tinderbox (bug 585691), was completed very late last night. Additionally, Nick Thomas discovered that we had lost the cronjob that takes care of cleaning out old builds from Tinderbox's memory. That script was re-enabled and a one time clean up removed 64GB of old build data. Both of these were completed around 4am PDT this morning and load is looking much better.

Especially because we're now running cleanup scripts on a regular basis again, I believe that this should get as back to good state.

Everyone should feel free to send justdave their thanks for staying up reaaaaallllly late last night to get us back to a good state.

Recent Tinderbox issues

As many of you know there have been numerous times lately that Tinderbox has become unresponsive, sometimes to the point of going down completely for a period of time. This post will attempt to summarize the issues and what's being done about them.

The biggest issue is load (surprise!). In a period of a few years we've gone from a few active trees with tens of columns between them to tens of active trees with hundreds of columns between them. Unsurprisingly, this has made the Tinderbox server a lot busier. The biggest load items are:

  • showlog.cgi - Shows a log file for a specific build
  • showbuilds.cgi - Shows the main page for a tree (like this)
  • processbuilds.pl - Processes incoming "build complete" mail

A bit of profiling has also been done in bug 585814 to try to find specific hotspots.

We've already done a few things to help with Tinderbox load:

Other ways we're looking at improving the situation:

  • bug 585691 - Split up Tinderbox data processing from display. This wouldn't reduce overall load, but it should segregate it enough to keep the Tinderbox display up.
  • bug 390341 - Pregenerate brief and full logs. This would eliminate the need for showlog.cgi to uncompress logs in most cases.
  • bug 530318 - Put full logs on FTP server; stop serving them from Tinderbox.

You can have access to our machines

Do you have a random or permanent orange bug that you want to debug? Are you having trouble reproducing it? If so, please file a bug in mozilla.org:Release Engineering and we'll get you access to a machine. If the failure is on Linux you can even bypass this, and download a copy of our ref VM here. Anyone with time and effort is welcome to this offer -- you do not need to be an employee of Mozilla.

Serendipity

This week I've been continuing to work on issues related to the new Linux build machines mentioned in my last post. I'm hoping to resolve a few test failures on them before they get put back into production. Yesterday my morning was focused on this task and by lunch I was tearing my hair out. Over lunch with some MoTo co-workers we ended up chatting about these and the wonderful Ehsan ended up volunteering to look at one of them. By the end of the day 1/3 of the failures were fixed, and a previously unknown bug was resolved!

Big thanks to Ehsan for spending his valuable time on this!

GRUB, the MBR, BIOS bugs?

Recently we ordered a set of server class Linux machines to supplement our pool of VMs. They are lightning fast, especially compared to VMs, but it's been a bit of a bumpy ride getting them ready to go to production. Most notably we've had an mysterious problem where they would occasionally refuse to boot, halting at a "GRUB _" dialogue. It took awhile, but we believe we have this fixed now.

This problem first occurred on 2 out of 25 slaves. Catlee quickly discovered that it could be fixed with a simple re-installation of Grub, so that's we did, and moved on. The thought at the time was that the MBR somehow got partially overwritten or otherwise corrupted. A day later, 2 more slaves hit the same issue. Since it was the second time we hit the issue there was some more speculation and digging. We had made a few changes to the machines, including:

  • Changing the hard disk controller from "IDE" mode to "AHCI".

  • Changing the kernel to a PAE version.

Both of those were pretty quickly dismissed as the causes. It seemed very unlikely that the kernel version could cause an issue with the bootloader, and the problem didn't occur instantly after changing the disk controller mode, so that seemed unlikely too. With other important things happening we again moved on.

The next day I did some more googling, this time about GRUB in general, and came across a page detailing the GRUB boot process. In it, it talks about how to dump the contents of the MBR and view it as hex. Seeing that made me very eager to compare a working slave vs. a busted one. Unfortunately there was no longer a busted machine to look at.

After 5 or so days without issues, and after all other setup and configuration issue was taken care of we decided to move them to production and deal with the GRUB problems if they arose. As luck would have it, 2 machines refused to boot as they were being moved to production. After booting from a rescue disk and dumping the MBR I found that bytes 0x40 through 0x49 differed against a working slave. I also noticed that the MBR of a busted slave was identical to one that had never broken, and thus, never had GRUB re-installed. This seemed to rule out MBR corruption.

With some more information in my hands I looked for some help or pointers from the GRUB developers, on Freenode. One of them pointed me to this section of the GRUB Manual which documents some key bytes of the MBR. Notably, byte 0x40 is described as "The boot drive. If it is 0xFF, use a drive passed by BIOS.". On a working slave this was set to 0xFF. On a broken one, it was set to 0x80 (which I was told means "first hard drive"). That certainly sounds like something that could affect bootability!

After thinking it over a few times I came to the conclusion that somehow 0x80 must end up being the wrong device to boot from. I also realized that no slave which had had GRUB re-installed had failed again. With all of that I became confident that re-installing GRUB would fix the problem permanently. I ran all of this by Catlee who told me that GRUB developers had told him that the BIOS could be re-ordering drives semi-randomly. That piece of information seems to fill in the last bit of the puzzle and I'm more confident than ever that GRUB installation will permanently fix the problem.

It's still a mystery to me why the BIOS would be re-ordering the drives at random. There's a "BIOSBugs" page on the GRUB wiki which describes a problem where the BIOS sends the wrong boot device. Since relying on the BIOS to send the boot device has fixed our problem I don't think it's the same thing. I haven't been able to find any information on this specific issue, or how to find out what boot device the BIOS is sending the Bootloader, which makes it difficult to truly confirm our fix. If anyone has hit this, or knows how to get at this kind of information I'd love to hear from you.

All about the RelEng sheriff

Since February of this year we've had a rotating RelEng "sheriff" available. We started it to make a couple of things better:

  • Improve response time on critical issues
  • Avoid having the whole team distracted with infrastructure issues

By and large, this has been an improvement for us and we think, for developers as well. Serious issues are dealt with more quickly; developers and the developer sheriff have someone specific to go to with acute issues that come up. Internally, this has helped us focus more, too. With the RelEng sheriff dealing with triage and other acute issues the rest of us are able to focus on our other work without distraction.


What is the RelEng sheriff responsible for?


Who is the RelEng sheriff?

The RelEng sheriff is rotated weekly. You can find out who the current RelEng sheriff is by looking at the schedule.


How to get a hold of the RelEng sheriff

The best place to find them is on IRC, in #build or #developers. They should be wearing a '|buildduty' tag at the end of their nick. You can also get our attention in other ways, if IRC doesn't work for you:

Bugs and IRC pokes are the preferred methods but any will work. Also note that the RelEng sheriff is only around during their normal working day, which can be PDT/PST, EDT/EST, or NZDT/NZST. If a RelEng sheriff isn't around, someone can be reached in #build.


What can your sheriff do you for you?

The on-duty Releng Sheriff would be more than happy to do any of the following for you:

  • Trigger any sort of build or test run you need, including:
    • Extra unit test or Talos runs of any given build
    • Retriggering builds that fail for spurious reasons
  • Deal with any nightly updates that fail
  • Help debug possible build machine issues
  • Help debug test issues that you cannot reproduce yourself
  • Answer questions you may have about build or test infrastructure

The RelEng sheriff is also a good first-contact point for any other random things. They may be able to help you directly but if not, they can certainly point you to the person who can.

After reading this, I hope you have a better understanding of the who, what, and why of the RelEng sheriff. If anything is unclear or absent I'm happy to clarify.

Anatomy of an SDK update

Over the course of the past week or so I've been working on rolling out the Windows 7 SDK to our build machines. Doing so presented two challenges: Getting the SDK to deploy silently and properly, and updating the appropriate build configurations to use it. Neither of these may sound very challenging, and indeed, they didn't to me either, but because of a combination of factors this ended up becoming a week long ordeal. In this post I will attempt to detangle everything that happened.

Let's start with the actual SDK installation. Unlike most other reasonable packages, the Windows 7 SDK is not distributed as an MSI package, but rather a collection of MSIs wrapped in an EXE. Unfortunately, this EXE doesn't enable you to do a customized, silent install - the precise thing we need. Vainly, I thought I could figure out the proper order and magic options to install the enclosed MSIs properly. Needless to say, this failed. To work around this I fell back onto using an Autoit script that would click through the interactive installer for me. It took some fuss, but not too much difficulty to get that working.

Now, the fun part (of deployment). We use a piece of software called OPSI to schedule and perform software installations across our farm of 80 or so Windows VMs. OPSI runs very early in the Windows start-up process, and actually executes as the SYSTEM user. Well, it turns out that the Windows 7 SDK must be installed by a full user, not the SYSTEM account. This seems unnecessary, as we've deployed other SDKs through OPSI in the past without issue. After trying to fake it out by setting various environment variables I turned to the OPSI forums for some help. (As an aside, the OPSI developers have been fantastic in their support of our installation, many thanks to them.) It turns out that I'm not the first person to hit problems like this. They pointed me to a template for a script that works around such an issue. The solution ends up being:

  1. Copy installation files to the slave
  2. Create a new user in the Administrators group, set that user to automatically login at next boot
  3. Reboot, and run the package installation at login
  4. Restore the original automatic login, reboot
  5. Cleanup (delete installation files, remove the created user)

This is obviously quite hacky, but it gets the job done.

So! With that in hand (and in repo) we set the SDK to deploy over the course of Wednesday night and Thursday morning. Overall, this went smoothly. For a reason (which I haven't yet figured out) some of the slaves needed some kicking to do the installation properly.

Remember how I said part 2 of this was updating the build configurations? I had planned to do this on Friday, and even posted a patch in preparation. Well, it turns out that MozillaBuild likes to be smart and find the most recent SDK and compiler for you. This completely slipped my mind while I was doing the deployment and a result, all builds from Thursday (yesterday) morning to Friday (today) morning, including those on mozilla-1.9.1, were done with the Windows 7 SDK. This went unnoticed most of Thursday until I was doing a final test of my build configuration patch.

Here's where the fun starts for this part. After discovering I'd accidentally changed the SDK for everything I went into a bit of a panic and rapidly started testing some fixes out in our staging environment. During the course of this I discovered that things were worse than I thought. Most builds were using the Windows 7 SDK, but not the "unit test" ones. So we weren't even using the same SDK for all the builds for a given branch! Getting all of that sorted out was compounded by all of the iterations of path styles (c:/ vs. c: vs. /c/) I had to try before I found the magic combination. In the end, I discovered a few things:

  • If you're specifying LIB/INCLUDE/SDKDIR in a mozconfig, you must use Windows-style paths
  • If you're specifying PATH in a mozconfig, you CANNOT use Windows-style paths - you must use MSYS style
  • You can't test for these things properly without clobbering

As I write this the first set of builds that all use the correct SDK are finishing up, and this deployment from hell appears to be nearly over. I want to express a special thanks to the OPSI developers, who were very helpful, and to Nick Thomas and Chris AtLee, for their patience with my countless iterations of build configuration patches. As a final note, let me state explicitly which SDK is being used where:

  • Windows Vista SDK (6.0a): mozilla-1.9.1 builds
  • Windows 7 SDK (7.0): mozilla-central, mozilla-1.9.2, TraceMonkey, Electrolysis, and Places builds

WinCE and WinMO builds are unaffected by this deployment.

mozilla-central, mozilla-1.9.2 nightly builds (ATTN: nightly users)

Because of the major version bump in mozilla-central, all users of mozilla-central nightlies will be bumped to mozilla-1.9.2 nightlies today. If you want to continue to track the Firefox 3.6 / Gecko 1.9.2 builds no action is required. If you want to track the post-1.9.2 version or absolute "trunk" of Firefox/Gecko you will need to download today's mozilla-central nightly build, found in the nightly area of the ftp server.