Nagios retention problem.

Andreas Ericsson ae at op5.se
Fri Nov 7 15:07:45 CET 2008


Markus.Almroth at teliasonera.com wrote:
> I run a nagios installation with 522 servers and 4654 service checks.
> 
> When adding or removing clients, it happens that about half or perhaps
> 2/3 of the service checks loose all status retention. What is more
> concerning is that they also go back to "initial state" eg Notifications
> are turned off!! This is bad.
> 

I'll clarify a bit here for history reasons, so that people reading the
ML archives knows what's going on. I've gotten the details from our
support staff.

"Adding a client" in this case means the equivalent of running

  /etc/init.d/nagios reload

or, in plaintext, sending SIGHUP to Nagios.

> It does'nt happen every time, and it is'nt the same servers every time.
> 

Does it happen with services or with hosts? If it's random, does it more
usually happen with hosts/services that alphabetically sort last?

Apart from that, I'll need some more info to properly determine what's
going wrong here. What OS type/version are you using? 64 or 32-bit?
Multi-processor or single? What version of glibc are you using (actually,
what version of libpthread, but one can be inferred from the other)?

If you're running this on VMWare on a guest-OS emulating multiple CPU's,
I'm *guessing* you're running into an issue of Nagios not properly
checking for received signals before starting to write the retention
file, so the thread responsible for writing it gets killed by a signal
delivered to the controller thread. If you're running Nagios in VMWare
(a big nono as most know), this is more likely to happen.

You could try sending the RESTART_PROCESS command to Nagios' command-
file instead, but you probably want to stagger it a bit so you don't
spam the poor FIFO in case you get lots of reload-requests at in a
short timeframe, like touching a file and then reloading once every
five minutes (from a cron-job) if the file exists (make sure to
remove the file after restarting, or you'll be wasting cycles at a
tremenduous rate).

Needless to say, we don't have this problem and I haven't heard from
anyone else that suffers from it either, which suggests to me that
you're doing something that isn't quite normal. Having fired up our
stress-test config (12000 hosts, 60000 services, running a plugin
that emulates extremely skittish behaviour and submitting random
commands every now and then) on one of our servers, I've failed to
reproduce this problem.

> Very strange. It seems to me like some kind of buffer overflow.

It's not a buffer overflow. A buffer overflow would have left your
system riddled with core-dumps and nagios would not have continued
running after receiving the SIGHUP.

> It started when I upgraded from 2.9 to 3.0.4.
> 

Strange. Given that there are no changes in the core between 3.0.4 and
3.0.5, I don't think it's worth upgrading to see if that solves the
problem (although you probably want to use 3.0.5, or the even more
fixed 3.0.5p1 from http://www.op5.org/src/nagios-3.0.5p1.tar.gz anyway
for the security fixes they add).

If you figure out what it is, or if you can give me enough information
to reproduce it, I'll see what I can do to fix this. We're just about
to ship a release right now though, so I won't have time to do anything
about it until monday at the earliest.

Good luck.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list