Jul 01 2009

DRBD2, OCFS2, Unexplained crashes

I was trying to setup a dual-primary DRBD environment, with a shared disk with either OCFS2 or GFS. The environment is a Centos 5.3 with DRBD82 (but also tried with DRBD83 from testing) .

Setting up a single primary disk and running bonnie++ on it worked Setting up a dual-primary disk, only mounting it on one node (ext3) and running bonnie++ worked

When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both nodes, basic functionality seemed in place but usually less than 5-10 minutes after I start bonnie++ as a test on one of the nodes , both nodes power cycle with no errors in the logfiles, just a crash.

When at the console at the time of crash it looks like a disk IO (you can type , but actions happen) block happens then a reboot, no panics, no oops , nothing. ( sysctl panic values set to timeouts etc )
Setting up a dual-primary disk , with ocfs2 only mounting it on one node and starting bonnie++ causes only that node to crash.

On DRBD level I got the following error when that node disappears

  1. drbd0: PingAck did not arrive in time.
  2. drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
  3. pdsk(UpToDate -> DUnknown )
  4. drbd0: asender terminated
  5. drbd0: Terminating asender thread

That however is an expected error because of the reboot.

At first I assumed OCFS2 to be the root of this problem ..so I moved forward and setup an ISCSI target on a 3rd node, and used that device with the same OCFS2 setup. There no crashes occured and bonnie++ flawlessly completed it test run.

So my attention went back to the combination of DRBD and OCFS
I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and the 83 variant from Centos Testing

At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour

Both the DRBD as the OCFS mailinglist were fairly supportive pointing me out that it was probably OCFS2 fencing both hosts after missing the heartbeat, and suggested increasing the deathtimetimeout values.

I however wanted to confirm that. As I got no entries in syslog I attached a Cyclades err Avocent Terminal server to the device in the hope that I'd capture the last kernel messsages there ... no such luck either.

On the OCFS2 mailinlist people pointed out that i'd use netconsole to catch the logs on a remote node
I set up netconsole using

  1. modprobe netconsole netconsole="@/,@172.16.32.1/"
  2. sysctl -w kernel.printk="7 4 1 7"

After which indeed I catched error on my remote host..

  1. [base-root@CCMT-A ~]# nc -l -u -p 6666
  2. (8,0):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device
  3. drbd0 after 478000 milliseconds
  4. (8,0):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active
  5. regions.
  6. ocfs2 is very sorry to be fencing this system by restarting

One'd think that it output over Serial console before it log over the network :) It doesn't .

Next step is that I`ll start fiddling some more with the timeout values :) (note the ":)")

Jun 29 2009

Diaper Needs Service Problem

Last Saturday late, Sandy gave birth to our 2nd daughter
Amber, pics etc are on her own site

So we'll be changing diapers of 2 little Buytaert kids for a while )

PS. Craig from O'ReillyGMT gets the credit for inventing the new DNS acronym,

Jun 24 2009

Weird Scenes inside the Tarball, Midnight Commander woes

I was transferring an apt repository to a remote site using the tarbal I created locally. When the first machines tried to do an apt-get update from that repository they failed to getthe package list.

  1. Jun 24 11:01:28 10.99.2.253 root: Failed to fetch <a href="http://10.99.0.1/repo/ntc/i386/base/pkglist.CentOsDistro" title="http://10.99.0.1/repo/ntc/i386/base/pkglist.CentOsDistro">http://10.99.0.1/repo/ntc/i386/base/pkglist.CentOsDistro</a> 404 Not Found
  2. Jun 24 11:09:29 10.99.2.253 root: Err <a href="http://10.99.0.1" title="http://10.99.0.1">http://10.99.0.1</a> i386/CentOsDistro pkglist

However at first sight the appropriate files were in place

  1. base.old]# ls -al
  2. total 700
  3. drwxr-xr-x 2 root root 4096 Jun 23 10:09 .
  4. drwxr-xr-x 11 root root 4096 Jun 24 13:01 ..
  5. -rw-r--r-- 1 root root 372662 Jun 23 10:07 pkglist.CentOsBase
  6. -rw-r--r-- 1 root root 59804 Jun 23 10:07 pkglist.CentOsBase.bz2
  7. -rw-r--r-- 1 root root 54788 Jun 23 10:07 pkglist.CentOsBaseUpdates
  8. -rw-r--r-- 1 root root 9908 Jun 23 10:07 pkglist.CentOsBaseUpdates.bz2
  9. -rw-r--r-- 1 root root 35685 Jun 23 10:07 pkglist.CentOsCustom
  10. -rw-r--r-- 1 root root 8231 Jun 23 10:07 pkglist.CentOsCustom.bz
  11. -rw-r--r-- 1 root root 34663 Jun 23 10:07 pkglist.CentOsDistro
  12. -rw-r--r-- 1 root root 8901 Jun 23 10:07 pkglist.CentOsDistro.bz
  13. -rw-r--r-- 1 root root 23432 Jun 23 10:07 pkglist.CentOsExtrapackages
  14. -rw-r--r-- 1 root root 5931 Jun 23 10:07 pkglist.CentOsExtrapackages.bz2
  15. -rw-r--r-- 1 root root 11144 Jun 23 10:07 pkglist.Externals
  16. -rw-r--r-- 1 root root 3607 Jun 23 10:07 pkglist.Externals.bz2
  17. -rw-r--r-- 1 root root 1650 Jun 23 10:09 release
  18. -rw-r--r-- 1 root root 129 Jun 23 10:07 release.CentOsBase
  19. -rw-r--r-- 1 root root 136 Jun 23 10:07 release.CentOsBaseUpdates
  20. -rw-r--r-- 1 root root 131 Jun 23 10:07 release.CentOsCustom
  21. -rw-r--r-- 1 root root 131 Jun 23 10:07 release.CentOsDistro
  22. -rw-r--r-- 1 root root 138 Jun 23 10:07 release.CentOsExtrapackages
  23. -rw-r--r-- 1 root root 128 Jun 23 10:07 release.Externals

Now if you take a closer look you'll notice 2 out of 6 .bz2 files ending with .bz rather than .bz2
On the server where I created the tarbal they had the correct .bz2 extension

When digging into the remote tarbal with mc (MidNight Commander) however the .bz extention was shown again.

Somehow while copying the directories out of the tarbal using mc the filenames got changed, as upon doing a regular untar from the command line the files once again had the correct .bz2 extention.

I`m not using mc again to get selected data out of a tarbal, I`ll stick to the old commandline for untarring :)

Weird stuff ...

Jun 15 2009

Twitter Woes

This one has been sitting in the drafts for too long, but frankly there's not much I have to add.
Twitter is acting weird , we all know that .. but the following movie really beats everything.

Yes that's right what you are seeing is me logging into my twitter account, checking my @krisbuytaert
messages and seeing the @vti messages and then even taking full control of their (in the mean while deleted) account.

I used Istanbul to record the session and realized later that I should have done it in a better quality, however I didn't want to do the "hack" again :)

Anyone got an explanation for this ? Obviously I've sent the VTI people a mail so they are informed of this hickup.

Jun 04 2009

resolv.conf problem

Collegue troubleshooting a machine that he couldn't log into anymore

root@Validation_Radio3_Ku_VN1:~# cat /etc/shadow
# /etc/resolv.conf generated on 20090604022133 by
/etc-ro/rc.d/resolvdotconf.pl
nameserver 192.168.104.1

Really folks .. Everything is a fine dns problem :)

PS. Yeah yeah .. the real problem was a filesystem corrupted

Jun 02 2009

Told Ya sooo

By now everybody and their neigbour has realized that indeed Everything is a funky dns problem, Frank is giving talks about it at ZooCamp, and Serge figured out the hard way the downtime of planet.geekdinner.be was due to a dns problem :)

But I told you different things before ... and some of you listened others are still reinventing the wheel as we go ...

Matt A. points out that the OpenBravo folks realized that one should try to build on top of Open Source projects rather than modify core code ..

Wonder where he read that before :
Some projects are prepared for local contributions, they have a modular framework that allows you to build on top of the project while not having to touch the core of a project, Drupal and openQRM are great examples of those, but not all projects are that smart. Needless to say that when you have such a modular framework you really shouldn't be modifying the core part of the platform, unless you are fixing a real bug.

Over at the MySQLPerformance blog, Peter points out that Open source projects don't do big fat marketing campaigns and the community doesn't appreciate features being developed in a corner then being released with a big bang ... we prefer our releases Early and Often...

On Automating Software installations, a Tom Limoncelli about how we install software and debug setups, with a nice quote "Oh my god. Is that why nobody uses the GUI we spend millions to develop?".
Well Luke said it before .. If my computer can't install it .. the installer is broken

Stephen Walli has some good toughts on how Open Source vendors should setup their partner programs, indeed with their eyes wide open ..

I ranted on Open Source Vendors thinking they should still work with partners models and the channel the traditional way before

However I have to admit that over the last month I did talk to people that do understand our Love / Hate relation ship with the Open Source Vendors that want to partner with us .. and that some of the newer Open Source Vendors are actually attracted by our different way of tackling partnerships.

Oh well.. as Tarus says .. my 3 readers understand..

May 26 2009

ZooCamp

Last saturday was ZooCamp , also known as Barcamp Antwerp II

The location , the Antwerp Zoo, was the same as for the last CloudCamp Antwerp edition, so I was already pretty familiar with the location.

Lots of new people at ZooCamp and also a bunch of regulars..

My 4th barcamp and the 2nd where I didn't do a presentation, I hope the fact that Inuits was sponsoring ZooCamp made up for that.. contributions can be done in different forms, and you can't always keep coming up with new topics.

I went to ZooCamp with a backpack full of internet connected things as that was the topic , my freshly upgraded TuxDroid and my Chumby .. but I didn't even have the time to fully unpack them and set em up. The biggest problem with the chumby was the website authentication that was required for wifi access.. and a Chumby with no wifi is like a pub with no beer.
I had even planned to do a rerun of my CloudSec talk from the last CloudCamp Antwerp, however the schedule filled quite fast with different interesting talks ...

Frank even managed to hijack my favourite topic, yes that's right .. he did a talk titled "Everything is a Frakkin DNS Problem". He pretty much covered zero of my typical cases so I can still do a run of my version of that talk one day :)

This barcamp was almost back to basics, apart from the location, wifi and the obligatory catering the crowd took charge , Lunch was everybody on their own .. so we ran out for Sushi , great ..
Not sure if we need to cough up that amount of money for a location in the future ..
Ideally we'd be able to find a location for free at some companies office, but that seems to be a difficult task for a saturday event.

Oh and we got to see the baby Elephant.. it was sleeping :)

May 19 2009

Puppet Meetup Leuven

As Teyo will be in Leuven (Belgium) next week for the training , some folks had the idea to meetup after the training for a Puppet Users Meetup.

So the plan is to meet for Drinks and Puppettalk, next week wednesday as of 20u00 in the Domus in Leuven

http://www.domusleuven.be/

I hope to meet a bunch of you there !

PS. Feel free to spread the news.
PPS. There is a chances I won't make it myselve which shouldn't keep you folks back :)

May 19 2009

ZooCamp

Less than 1 week till ZooCamp, you know .. Barcamp Antwerp 2, but then in the Antwerp Zoo.

Register here

May 15 2009

Fun with Google Docs Urls

I`m not a big user of docs.google.com , but occasionally I use it sharing a public document to work on with friends or collegues.

So we have this spreadsheet we're sharing with some family and friends to swap Disney stickers. Google Docs has the option to publish that document publicly as html for others to view.

So I tried , and it generated me a very nice url

http://spreadsheets.google.com/pub?key=rtlvf2-JSU1Pw-oPtuIZBPg&output=ht...

My sleepy eye catched the A1:C300 ending part .. which was generated by the friendly popup that asked me if I wanted to show all Sheets, or just a range of the page.

Dare I pasting that URL into another browser and change the range ? Like changing the range from A1:C300 to A1:D300 ?

Suprise suprise .. that worked ! I could perfectly see the content of the other cells.

Apart from pointing to the Google API the popup doesn't really mention that publishing only a range won't restrict the actual viewing off the other data.

I can imagine some less technical savvy people to expect the rest of their data is secure... Well, it obviously it's not !
Not sure if Google does this on purppose, or by accident.

If it stops working next week it was by accident :)