Recently we ordered a set of server class Linux machines to supplement our pool of VMs. They are lightning fast, especially compared to VMs, but it's been a bit of a bumpy ride getting them ready to go to production. Most notably we've had an mysterious problem where they would occasionally refuse to boot, halting at a "GRUB _" dialogue. It took awhile, but we believe we have this fixed now.
This problem first occurred on 2 out of 25 slaves. Catlee quickly discovered that it could be fixed with a simple re-installation of Grub, so that's we did, and moved on. The thought at the time was that the MBR somehow got partially overwritten or otherwise corrupted. A day later, 2 more slaves hit the same issue. Since it was the second time we hit the issue there was some more speculation and digging. We had made a few changes to the machines, including:
Changing the hard disk controller from "IDE" mode to "AHCI".
Changing the kernel to a PAE version.
Both of those were pretty quickly dismissed as the causes. It seemed very unlikely that the kernel version could cause an issue with the bootloader, and the problem didn't occur instantly after changing the disk controller mode, so that seemed unlikely too. With other important things happening we again moved on.
The next day I did some more googling, this time about GRUB in general, and came across a page detailing the GRUB boot process. In it, it talks about how to dump the contents of the MBR and view it as hex. Seeing that made me very eager to compare a working slave vs. a busted one. Unfortunately there was no longer a busted machine to look at.
After 5 or so days without issues, and after all other setup and configuration issue was taken care of we decided to move them to production and deal with the GRUB problems if they arose. As luck would have it, 2 machines refused to boot as they were being moved to production. After booting from a rescue disk and dumping the MBR I found that bytes 0x40 through 0x49 differed against a working slave. I also noticed that the MBR of a busted slave was identical to one that had never broken, and thus, never had GRUB re-installed. This seemed to rule out MBR corruption.
With some more information in my hands I looked for some help or pointers from the GRUB developers, on Freenode. One of them pointed me to this section of the GRUB Manual which documents some key bytes of the MBR. Notably, byte 0x40 is described as "The boot drive. If it is 0xFF, use a drive passed by BIOS.". On a working slave this was set to 0xFF. On a broken one, it was set to 0x80 (which I was told means "first hard drive"). That certainly sounds like something that could affect bootability!
After thinking it over a few times I came to the conclusion that somehow 0x80 must end up being the wrong device to boot from. I also realized that no slave which had had GRUB re-installed had failed again. With all of that I became confident that re-installing GRUB would fix the problem permanently. I ran all of this by Catlee who told me that GRUB developers had told him that the BIOS could be re-ordering drives semi-randomly. That piece of information seems to fill in the last bit of the puzzle and I'm more confident than ever that GRUB installation will permanently fix the problem.
It's still a mystery to me why the BIOS would be re-ordering the drives at random. There's a "BIOSBugs" page on the GRUB wiki which describes a problem where the BIOS sends the wrong boot device. Since relying on the BIOS to send the boot device has fixed our problem I don't think it's the same thing. I haven't been able to find any information on this specific issue, or how to find out what boot device the BIOS is sending the Bootloader, which makes it difficult to truly confirm our fix. If anyone has hit this, or knows how to get at this kind of information I'd love to hear from you.