The machine that vanished.

Today I lost a machine, a physical one, I couldn't find it back in my rack anymore. One moment I was logged on to it, and when I instructed it to boot off the network again for a fresh installation I couldn't find it back anymore, it was gone.

When you have different ad hoc build development environments, you often grab whatever hardware is available to add to your pool and hope it doesn't kick you back, time always works against you when you have to build a fresh platform from a pool of hardware ready to be reused.

I had half a rack of hardware ready to be redeployed, the default boot order of most machines is Disk, Network so we trigger a fresh network install by overwriting the MBR. So the one machine .. after doing a quick check to see if there was nothing relevant on it anymore we sent it to the reboot pool.

The host was supposed to boot of the network, but I didn't even see a dhcp request coming in. So off to the lab it was .. where was that machine.. none of the consoles I tried was the correct one... until I found one box.. with a really really old installation , a machine that had returned from a different office.

And then it all came clear ... unlike all the other machines this machine had a 2 disk raid setup, which we actually weren't using , we indeed hat cleared the bootsector of the first disk, but not the second disk .. and we never had really cleared the 2nd disk. So rather than booting of the network because the first disk failed it booted of the old copy on the second disk.

Scratching that 2nd disk solved the problem .. for once it wasn't a DNS problem, but the RAID setup wasn't really helpfull either :)

PS. Yes re-labeling the machines is still on the todolist .. maybe next year :)

Comments

Tarus's picture

#1 Tarus : Lost a machine this week, too

We had a prolonged power outage in town, but the battery in the generator that powered the switch to cause it to switch over and actually power things was bad, so the generator never came on.

We only have about 20 minutes of UPS time.

So, when I got back from lunch the first thing I did was go to OpenNMS and track down all of the machines with outages. I couldn't figure out one of them for the longest time, until I was reminded it was a VM (we don't have our Xen server set up to automatically "create" the VMs on a cold boot).

It's weird to be at a point where I don't remember the functions of all the machines on the network, or whether or not they are real.


Frank's picture

#2 Frank : So your raid setup was wrong

So your raid setup was wrong and thus you blame raid? :)


Kris Buytaert's picture

#3 Kris Buytaert : Wasn't mine

It wasn't MY raid setup ..

The machine wasn't supposed to have raid :)