raid sucks

May 15 2009

The machine that vanished.

Today I lost a machine, a physical one, I couldn't find it back in my rack anymore. One moment I was logged on to it, and when I instructed it to boot off the network again for a fresh installation I couldn't find it back anymore, it was gone.

When you have different ad hoc build development environments, you often grab whatever hardware is available to add to your pool and hope it doesn't kick you back, time always works against you when you have to build a fresh platform from a pool of hardware ready to be reused.

I had half a rack of hardware ready to be redeployed, the default boot order of most machines is Disk, Network so we trigger a fresh network install by overwriting the MBR. So the one machine .. after doing a quick check to see if there was nothing relevant on it anymore we sent it to the reboot pool.

The host was supposed to boot of the network, but I didn't even see a dhcp request coming in. So off to the lab it was .. where was that machine.. none of the consoles I tried was the correct one... until I found one box.. with a really really old installation , a machine that had returned from a different office.

And then it all came clear ... unlike all the other machines this machine had a 2 disk raid setup, which we actually weren't using , we indeed hat cleared the bootsector of the first disk, but not the second disk .. and we never had really cleared the 2nd disk. So rather than booting of the network because the first disk failed it booted of the old copy on the second disk.

Scratching that 2nd disk solved the problem .. for once it wasn't a DNS problem, but the RAID setup wasn't really helpfull either :)

PS. Yes re-labeling the machines is still on the todolist .. maybe next year :)