More than 3 months worth of machine time has been saved by *you*

Ever since bug 541364 landed 4 months ago it’s been possible to selectively disable platforms on try by overriding specific mozconfigs. Since that time, roughly 2321 hours (that’s 96 days or 13 weeks or ~3 months) of machine time have been saved through this — and that calculation is only compile time, even more than that has been saved on the test side. I just want to say a huuuuuuge THANKS! on behalf of RelEng. Taking the time to disable unneeded things on a push makes a noticeable difference in the time it takes us to turn around a full set of tests, especially during busier times.

More stats:

  • The most common platforms disabled were Mac 64-bit (115 times), Maemo (4: 115, 5 gtk: 106, 5 qt: 105), and Android (120 times)
  • The least common platforms disabled were Windows (58 times), Linux 32-bit (85 times), and Mac 32-bit (96 times)

My week of buildduty

I’ve been on buildduty this week, which means I’ve been subject to countless interrupts to look at infrastructure issues, start Talos runs, cancel try server builds, et. al. It’s not shocking to me that I got very little that I planned to done this week, but I’m still a bit surprised when I look at the full list of what I *have* done:
- Supervise 4.0b4 release
- Deal with test masters getting really backed up
- RFT (Request for Talos) / try build canceling (~15 times)
- Disable merging for tests on try builds
- Help IT debug issues with Try repos
- Help Axel fix issues with l10n nightly builds on trunk
- Disable all nightly builds after omnijar bustage
- Schedule/manage downtime
- Sign BYOB builds
- Stage MozillaBuild 1.5.1
- Add Camino 2.0.4 to bouncer

….and my Friday is only half over!

Which build infrastructure problems do you see the most?

I’m hoping to tackle bug 505512 (Make infrastructure related problems turn the tree a color other than red) in the next few weeks. Most of the ground work for it is laid, which means that most of what I’ll be doing is parsing logs for infrastructure errors.

So, what errors do you see most from our build infrastructure? Are there other things that you would classify as infrastructure issues? Please add any suggestions you have to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors

Update on recent Tinderbox issues

My last post talked about the issues we’ve been having with load on the Tinderbox server and some ways we could fix it. I’m happy to report that two things were completed yesterday that should keep the load under control for the foreseeable future.

One of the things mentioned in my previous post, splitting incoming build processing from the rest of Tinderbox (bug 585691), was completed very late last night. Additionally, Nick Thomas discovered that we had lost the cronjob that takes care of cleaning out old builds from Tinderbox’s memory. That script was re-enabled and a one time clean up removed 64GB of old build data. Both of these were completed around 4am PDT this morning and load is looking much better.

Especially because we’re now running cleanup scripts on a regular basis again, I believe that this should get as back to good state.

Everyone should feel free to send justdave their thanks for staying up reaaaaallllly late last night to get us back to a good state.

Recent Tinderbox issues

As many of you know there have been numerous times lately that Tinderbox has become unresponsive, sometimes to the point of going down completely for a period of time. This post will attempt to summarize the issues and what’s being done about them.

The biggest issue is load (surprise!). In a period of a few years we’ve gone from a few active trees with tens of columns between them to tens of active trees with hundreds of columns between them. Unsurprisingly, this has made the Tinderbox server a lot busier. The biggest load items are:

  • showlog.cgi – Shows a log file for a specific build
  • showbuilds.cgi – Shows the main page for a tree (like this)
  • processbuilds.pl – Processes incoming “build complete” mail

A bit of profiling has also been done in bug 585814 to try to find specific hotspots.

We’ve already done a few things to help with Tinderbox load:

Other ways we’re looking at improving the situation:

  • bug 585691 – Split up Tinderbox data processing from display. This wouldn’t reduce overall load, but it should segregate it enough to keep the Tinderbox display up.
  • bug 390341 – Pregenerate brief and full logs. This would eliminate the need for showlog.cgi to uncompress logs in most cases.
  • bug 530318 – Put full logs on FTP server; stop serving them from Tinderbox.