Hi Mark, Sorry, was caught up with other stuff so couldn't reply to this. It looks like your retention settings are fine. It would be interesting if you were able to recreate the problem, but my guess is that it probably requires the special circumstances of your production environment, so this may not be prudent. Nagios will write retention.dat on every clean shutdown and every $retention_interval minutes. Your logs look like Nagios did a clean shutdown, but maybe it for some reason didn't write retention.dat. In any case, I can't really see anything wrong here, but if I have time I'm going to see if I can replicate the behavior you experienced. Best regards, Martin Melin <div class="gmail_quote">On Wed, Nov 4, 2009 at 4:10 AM, Frost, Mark {PBG} <<a href="mailto:mark.frost1@pepsi.com">mark.frost1@pepsi.com</a>> wrote: <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> >From: Martin Melin [mailto:<a href="mailto:mmelin@gmail.com">mmelin@gmail.com</a>] >Sent: Tuesday, November 03, 2009 4:05 PM >To: <a href="mailto:nagios-users@lists.sourceforge.net">nagios-users@lists.sourceforge.net</a> <div class="im">>Subject: Re: [Nagios-users] Restarts resetting soft critical states > >> On Tue, Nov 3, 2009 at 9:35 PM, Frost, Mark {PBG} <<a href="mailto:mark.frost1@pepsi.com">mark.frost1@pepsi.com</a>> wrote: >> >> >>-----Original Message----- >>>From: Andreas Ericsson [mailto:<a href="mailto:ae@op5.se">ae@op5.se</a>] >>>Sent: Monday, November 02, 2009 7:02 AM >>>To: Frost, Mark {PBG} >>>Cc: <a href="mailto:nagios-users@lists.sourceforge.net">nagios-users@lists.sourceforge.net</a> >>>Subject: Re: [Nagios-users] Restarts resetting soft critical states >>> >>>> On 10/29/2009 08:50 PM, Frost, Mark {PBG} wrote: >>>> </div><div><div></div><div class="h5">>>>> Both the reporting server and the distributed node share the same >>>> attributes for retention and soft states: >>>> >>>> soft_state_dependencies=0 >>>> passive_host_checks_are_soft=1 >>>> retain_state_information=1 >>>> use_retained_program_state=1 >>>> use_retained_scheduling_info=1 >>>> retained_host_attribute_mask=0 >>>> retained_service_attribute_mask=0 >>>> retained_process_host_attribute_mask=0 >>>> retained_process_service_attribute_mask=0 >>>> retained_contact_host_attribute_mask=0 >>>> retained_contact_service_attribute_mask=0 >>>> >>>> While I would assume the restarts would disrupt Nagios a bit what with >>>> having to do start-time tasks again, I would not have expected that it >>>> would "start over" with the status of some checks. >>>> >>>> What am I missing here? >>> >>> >>> It seems you haven't grasped how bitmasks work. When you set the mask to >>> 0, >>> you essentially tell it to not let anything through. Set them to -1, or >>> leave them at the default values and you'll get the kind of state >>> retention >>> you want. >> >> Thanks, Andreas. Unfortunately, I'm still puzzled. The mask values you refer to are >> already set to the defaults (they're all 0's). I've never touched those or paid much >> attention to them until now. >> >> I'm actually confused by 2 aspects of this. It seems to me that the thing I'm trying to >> retain across a restart are soft check states (those are what are being reset). Looking >> at the MODATTR arguments in include/common.h (3.0.6) I don't see which of those >> attributes >would govern this. There's the *ENABLED attributes which really aren't >> changing here (and >are retained). All the other MODATTR's are (it seems to me) not >> changing in this case >either. >> >> The second thing that confuses me here is the verbage used to describe the mask >> functionality: >> >> # RETAINED ATTRIBUTE MASKS (ADVANCED FEATURE) >> # The following variables are used to specify specific host and >> # service attributes that should *not* be retained by Nagios during >> # program restarts. >> >> So if MODATTR is set to none, based on the comment doesn't this mean that "NONE" of the >> attributes are NOT retained? I.e. all are retained (double-negative)? The on-line doc >> for these masks say "By default, all host and service attributes are retained." >> >> > I don't know the source code behavior, but I agree with this and a default nagios.cfg has > all of the masks set to zero, presumably to not mask anything, i.e. to not affect what's > retained. >> >> >> I do get masks, I just didn't see how these applied here. >> >> Your help is greatly appreciated. >> > I just did a quick experiment with the default values for *retain* variables in > nagios.cfg - which are exactly what you quote: > > [1257281477] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;1;FILE_AGE CRITICAL: File not found - /tmp/nagios > [1257281597] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;2;FILE_AGE CRITICAL: File not found - /tmp/nagios > [1257281604] Caught SIGTERM, shutting down... > [1257281604] Successfully shutdown... (PID=9617) > [1257281605] Nagios 3.0.6 starting... (PID=9721) > [1257281605] Local time is Tue Nov 03 21:53:25 CET 2009 > [1257281605] LOG VERSION: 2.0 > [1257281605] Finished daemonizing... (New PID=9722) > [1257281715] SERVICE ALERT: localhost;File age;CRITICAL;HARD;3;FILE_AGE CRITICAL: File >> not found - /tmp/nagios > > Everything works as expected. > > I'm guessing you have some other issue that's affecting Nagios' ability to save retention data. > > What's the value of state_retention_file and retention_update_interval for you? > > Have you checked that state_retention_file is updated when Nagios runs, that you're not close to capacity of the disk or that something basic like that is going on? > > Open up the file and grab the definition for the service in question, see what values are being saved. > > HTH, > > Regards, > Martin Melin </div></div>Martin, state_retention_file=/usr/local/eam/nagios/var/retention.dat retention_update_interval=60 I see that retention.dat was updated when I restarted Nagios maybe 20 minutes ago. I just tested disabling notifications for a check, but I guess based on my retention_update_interval I won't see the retention.dat file change for another 40 minutes. Nagios monitors the filesystem itself (ie. Nagios watches itself), but the filesystem it resides on is at 35% with 12GB free. If there were a problem with that or some other essential operation of Nagios, I think I'd see some other problem. In this case, I think an unusual set of circumstances were at play -- I was restarting Nagios every few minutes while a host was in the process of failing host checks as reported by the distributed nodes. Never seen that before, but also probably never happened to do it that way either. Looking at this item in retention.dat (it's a host check that we had this issue with, not a service check). This might not be all that useful as this issue isn't happening at the moment. At present, I see the following of interest last_state=0 last_hard_state=0 current_attempt=1 max_attempts=10 state_history=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 I assume perhaps that "last_state" might mean last soft state. It would be interesting to see this value if I could find a practical way to replicate this condition. I would also expect current_attempt to be higher than 1 and the state_history to show some non-OK states while this issue was happening. As I say, I'd have to see these values while this was changing. Thanks Mark </blockquote></div>