Nagios retention problem.

Markus.Almroth at teliasonera.com Markus.Almroth at teliasonera.com
Fri Nov 7 15:52:32 CET 2008
Previous message: Nagios retention problem.
Next message: Nagios retention problem.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
It seems only to happen vith services. It might be due to the fact that
hosts comes first in the retention.dat file.

Linux antnagios.sun.telia.se 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17
18:00:32 EDT 2006 i686 athlon i386 GNU/Linux
32-bit. 

glibc-2.3.4-2.25


Running on a VM-Ware server, but only one CPU is emulated.

I've tried restarting with kill -HUP or writing to the external commands
file, but it still happens occasionly. Now I've changed to restart,
we'll see what happens.

My impression is that the problem occurs on startup rather than
shutdown. I've set the retention_update_interval=0. Next time it happens
I will check the contents of the retention.dat file to make sure all
services is there, but I have a vague recollection they were before. My
memory might mislead me though. 

As far as I can see, Nagios does not use the retained data when doing a
restart (kill -HUP). It keeps the status.dat-file. (Line number 839 in
nagios.c). 

As far as the reading of retention.dat, I don't really get the details
of the mmap-file stuff. I will check how the reading is done next week.
VMWare servers are as you know known for unreliable io. Maybe the read
function does'nt check for errors.

Yes service checks last in alphabetical order.

What I meant with buffer overflow was like when you write commands to a
named pipe faster than the reading process can handle it you might loose
data. But I've checked, and noticed that the retention data is writen
directly into the linked list inside Nagios.

I was considering exactly what you suggest in touching a file and doing
restarts from cron, but it does'nt seem to matter if there are many or
few restarts. For a while I had the idea that the problem occured when
doing a reload before the startup routines were finished. But then it
happened again when doing a "single" reload, so it does'nt seem to be
the case...

I hav'nt been able to reproduce the problem with any reability either.
It just happens every 10 or 20 reloads or so and of course - never when
I want it to. The most annoying part of it is it has only happened in
the production server.

/Markus

-----Original Message-----
From: Andreas Ericsson [mailto:ae at op5.se] 
Sent: den 7 november 2008 15:08
To: Almroth, Markus M.
Cc: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] Nagios retention problem.


Markus.Almroth at teliasonera.com wrote:
> I run a nagios installation with 522 servers and 4654 service checks.
> 
> When adding or removing clients, it happens that about half or perhaps
> 2/3 of the service checks loose all status retention. What is more 
> concerning is that they also go back to "initial state" eg 
> Notifications are turned off!! This is bad.
> 

I'll clarify a bit here for history reasons, so that people reading the
ML archives knows what's going on. I've gotten the details from our
support staff.

"Adding a client" in this case means the equivalent of running

  /etc/init.d/nagios reload

or, in plaintext, sending SIGHUP to Nagios.

> It does'nt happen every time, and it is'nt the same servers every
time.
> 

Does it happen with services or with hosts? If it's random, does it more
usually happen with hosts/services that alphabetically sort last?

Apart from that, I'll need some more info to properly determine what's
going wrong here. What OS type/version are you using? 64 or 32-bit?
Multi-processor or single? What version of glibc are you using
(actually, what version of libpthread, but one can be inferred from the
other)?

If you're running this on VMWare on a guest-OS emulating multiple CPU's,
I'm *guessing* you're running into an issue of Nagios not properly
checking for received signals before starting to write the retention
file, so the thread responsible for writing it gets killed by a signal
delivered to the controller thread. If you're running Nagios in VMWare
(a big nono as most know), this is more likely to happen.

You could try sending the RESTART_PROCESS command to Nagios' command-
file instead, but you probably want to stagger it a bit so you don't
spam the poor FIFO in case you get lots of reload-requests at in a short
timeframe, like touching a file and then reloading once every five
minutes (from a cron-job) if the file exists (make sure to remove the
file after restarting, or you'll be wasting cycles at a tremenduous
rate).

Needless to say, we don't have this problem and I haven't heard from
anyone else that suffers from it either, which suggests to me that
you're doing something that isn't quite normal. Having fired up our
stress-test config (12000 hosts, 60000 services, running a plugin that
emulates extremely skittish behaviour and submitting random commands
every now and then) on one of our servers, I've failed to reproduce this
problem.

> Very strange. It seems to me like some kind of buffer overflow.

It's not a buffer overflow. A buffer overflow would have left your
system riddled with core-dumps and nagios would not have continued
running after receiving the SIGHUP.

> It started when I upgraded from 2.9 to 3.0.4.
> 

Strange. Given that there are no changes in the core between 3.0.4 and
3.0.5, I don't think it's worth upgrading to see if that solves the
problem (although you probably want to use 3.0.5, or the even more fixed
3.0.5p1 from http://www.op5.org/src/nagios-3.0.5p1.tar.gz anyway for the
security fixes they add).

If you figure out what it is, or if you can give me enough information
to reproduce it, I'll see what I can do to fix this. We're just about to
ship a release right now though, so I won't have time to do anything
about it until monday at the earliest.

Good luck.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Nagios retention problem.
Next message: Nagios retention problem.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list