The 4158 second catalog run.

Two of my tweets , sorry dents, earlier today caused some people to ask me what on earth I was doing :)

You don't exist, go away!

Was the first one, indeed .. it was a long time since I had actually seen that one.. this actually happens when you delete the user you are logged in with on a host, when the host notices you don't exist anymore it will tell you.

Now that is exactly what happend .. We were busy reordening the uid's on some hosts , so I modified the puppet config for that host and changed the uid values, a couple of minutes later I was told that I don't exist ..

The last time I saw that , was about 10 years ago when I was trying to fool some collegues :)

Now the second tweet that tracked some people's attention was the one about a very lengthy catalog run

  1. Apr 20 05:12:42 sipx-a puppet-agent[22384]: Finished catalog run in 4158.09 seconds

Indeed, a puppet catalog run of about 69 minutes, yes thats 1 hour and 9 minutes ..

The reason for this lengthy catalog run was the above uid reordering , combined with

  1. "/var/sipxdata/":
  2. owner => "sipxchange", group => "sipxchange",
  3. recurse => true,
  4. ensure => directory;

And about 5K files in that directory .. apparently recurse doesn't translate to chown -R yet :)


duritong's picture

#1 duritong : recursive management of large directories is tricky

There are some issues with managing a large folder hierarchy in puppet.

First, update to 2.6.x, as some improvements (memory and speed) were introduced in 2.6.x while managing large folder hierarchies. But actually, I would bet that you are already on 2.6.x?! ;)

(Regarding the next part I'm not sure if things changed recently or are still the same as I describe them. At least the situation was like that when I dealt with these problems...)

Second, puppet has the nice feature that it also knows (and records!) the checksum of each file managed in this directory. This means that puppet will compute for each file its md5sum (uff...) and also record it in its state.yml. You can turn that off by setting the checksum param to 'none'. This should actually already speed up things a lot.

And third, puppet will create for each file a resource object, so it can actually manage it. And this part is the trickiest part: How could we avoid that? As you mentioned puppet does not (yet) translate that to a chown -R and I doubt it will ever do that. For the simple reason that it is actually the idea of puppet to have for each file in this directory a resource object.

So how can we make that quicker, then? By translating the statement ourself to chown -R.

The simple exec statement:

  1. exec{'chown -R sipxchange:sipxchange /var/sipxdata': }

is quite obvious. We could combine it with a find statement that looks for files/directories not yet owned, by sipxchange:sipxchange and only run the chown if we find anything. But if you are mainly interested in speed then the above is much faster. Additionally, you could set loglevel to info so it does not poison your logs (and dashboard?) with a change each time puppet runs.

However, the disadvantage of this speedup is that puppet is not anymore (internally) dealing with a resource for each file and you are actually circumventing the puppet paradigm that a resource is only managed once. Also you are loosing the idempotent puppet runs. So you will need to remember that everything in this directory is managed specially and that if you are introducing any other resources in this directory, that are managed in another place, you will need to sync these two statements. Otherwise puppet will flip the ownership of a file always between runs, etc. Also you will loose some other capabilities such a notifications etc. Or at least they are less accurate.

So the main question that you need to answer in such a situation is: Which trade off am I willing to pay? Speed or consistency. In my opinion both are a valuable answer to certain situations. So (sadly) it is sometimes actually worth to cross over puppet's clean resource model in favor of (much more) speed. :(

Oh, another idea that just pops up my mind is that in combination with setting the checksum to none, you could also give a special tag for that resource. And then after got some love you could normally run puppet by excluding that large folder hierarchy and only run it once a day, or once a week with everything. Another special case, but it wouldn't circumvent the internal puppet model and hence maybe worth thinking about it (and voting for the feature in the bugtracker... ;) ).