Jan 28 2010

Implementing Raid Monitoring on a 3Ware 3w-9xxx based controller.

When you pull out a disk from your Raid setup it shows a warning in syslog

  1. Jan 27 10:18:22 EL860 kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0019): Drive
  2. removed:port=1.
  3. Jan 27 10:18:22 EL860 kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded
  4. unit:unit=0, port=1.

However if no one is looking at syslog that won't really be helpfull.

3Ware provides a tool from their site called tw_cli which can be used to manage
the raid setup from the command line.

  1. [EL860-root@EL860 admin]# tw_cli /c0 show
  3. Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
  4. ------------------------------------------------------------------------------
  5. u0 RAID-1 REBUILDING 41% - - 232.82 RiW ON
  7. VPort Status Unit Size Type Phy Encl-Slot Model
  8. ------------------------------------------------------------------------------
  9. p0 OK u0 232.88 GB SATA 0 - ST3250310NS
  10. p1 DEGRADED u0 232.88 GB SATA 1 - ST3250310NS

I'd figure I'd either have to write wrapper script around that or find some other way of integrating it.
Asking the question on ##infra-talk on gave me the following link to a check script on github

koollman: sdog: something like should work.

With that in your snmpd.conf you can get the info via snmp

  1. [root snmp]# snmpwalk localhost -v 2c -c public .
  2. 021 | grep ext
  3. UCD-SNMP-MIB::extIndex.1 = INTEGER: 1
  4. UCD-SNMP-MIB::extNames.1 = STRING: TW_RAID
  5. UCD-SNMP-MIB::extCommand.1 = STRING: /usr/local/sbin/check_tw
  6. UCD-SNMP-MIB::extResult.1 = INTEGER: 2
  7. UCD-SNMP-MIB::extOutput.1 = STRING: CRITICAL: Unit: u0, Type: RAID-1, Status: RE
  9. UCD-SNMP-MIB::extErrFix.1 = INTEGER: 0
  10. UCD-SNMP-MIB::extErrFixCmd.1 = STRING:
  11. UCD-SNMP-MIB::ssSysContext.0 = INTEGER: 2073
  12. UCD-SNMP-MIB::ssRawContexts.0 = Counter32: 11781783
  13. UCD-DLMOD-MIB::dlmodNextIndex.0 = INTEGER: 1

May 15 2009

The machine that vanished.

Today I lost a machine, a physical one, I couldn't find it back in my rack anymore. One moment I was logged on to it, and when I instructed it to boot off the network again for a fresh installation I couldn't find it back anymore, it was gone.

When you have different ad hoc build development environments, you often grab whatever hardware is available to add to your pool and hope it doesn't kick you back, time always works against you when you have to build a fresh platform from a pool of hardware ready to be reused.

I had half a rack of hardware ready to be redeployed, the default boot order of most machines is Disk, Network so we trigger a fresh network install by overwriting the MBR. So the one machine .. after doing a quick check to see if there was nothing relevant on it anymore we sent it to the reboot pool.

The host was supposed to boot of the network, but I didn't even see a dhcp request coming in. So off to the lab it was .. where was that machine.. none of the consoles I tried was the correct one... until I found one box.. with a really really old installation , a machine that had returned from a different office.

And then it all came clear ... unlike all the other machines this machine had a 2 disk raid setup, which we actually weren't using , we indeed hat cleared the bootsector of the first disk, but not the second disk .. and we never had really cleared the 2nd disk. So rather than booting of the network because the first disk failed it booted of the old copy on the second disk.

Scratching that 2nd disk solved the problem .. for once it wasn't a DNS problem, but the RAID setup wasn't really helpfull either :)

PS. Yes re-labeling the machines is still on the todolist .. maybe next year :)

Aug 25 2008

Raid is obsolete

In a lot of environments.

Peter gives a nice overview why you don't always need to invest in big fat redundant hardware.

We've tackled the topic last year already ..

Now I often get weird looks when I dare to mention that Raid is obsolete ..people fail to hear the "in a lot of environments"

Obviously the catch is in the second part, you won't be doing this for your small shop around the corner with just one machine. You'll only be doing this in an environment where you can work with a redundant array of inexpensive disks. Not with a server that has to sit in a remote and isolated location.

Next to that there are situations where you will be using raid, but not for redundancy, but for disk throughput.