Restarts resetting soft critical states

Frost, Mark {PBG} mark.frost1 at pepsi.com
Wed Nov 4 04:10:23 CET 2009



>From: Martin Melin [mailto:mmelin at gmail.com]
>Sent: Tuesday, November 03, 2009 4:05 PM
>To: nagios-users at lists.sourceforge.net
>Subject: Re: [Nagios-users] Restarts resetting soft critical states
>
>> On Tue, Nov 3, 2009 at 9:35 PM, Frost, Mark {PBG} <mark.frost1 at pepsi.com> wrote:
>>
>>
>>-----Original Message-----
>>>From: Andreas Ericsson [mailto:ae at op5.se]
>>>Sent: Monday, November 02, 2009 7:02 AM
>>>To: Frost, Mark {PBG}
>>>Cc: nagios-users at lists.sourceforge.net
>>>Subject: Re: [Nagios-users] Restarts resetting soft critical states
>>>
>>>> On 10/29/2009 08:50 PM, Frost, Mark {PBG} wrote:
>>>>
>>>>  Both the reporting server and the distributed node share the same
>>>> attributes for retention and soft states:
>>>>
>>>> soft_state_dependencies=0
>>>> passive_host_checks_are_soft=1
>>>> retain_state_information=1
>>>> use_retained_program_state=1
>>>> use_retained_scheduling_info=1
>>>> retained_host_attribute_mask=0
>>>> retained_service_attribute_mask=0
>>>> retained_process_host_attribute_mask=0
>>>> retained_process_service_attribute_mask=0
>>>> retained_contact_host_attribute_mask=0
>>>> retained_contact_service_attribute_mask=0
>>>>
>>>> While I would assume the restarts would disrupt Nagios a bit what with
>>>> having to do start-time tasks again, I would not have expected that it
>>>>  would "start over" with the status of some checks.
>>>>
>>>> What am I missing here?
>>>
>>>
>>> It seems you haven't grasped how bitmasks work. When you set the mask to
>>> 0,
>>> you essentially tell it to not let anything through. Set them to -1, or
>>> leave them at the default values and you'll get the kind of state
>>> retention
>>> you want.
>>
>> Thanks, Andreas.  Unfortunately, I'm still puzzled.  The mask values you refer to are
>> already set to the defaults (they're all 0's).  I've never touched those or paid much
>> attention to them until now.
>>
>> I'm actually confused by 2 aspects of this.  It seems to me that the thing I'm trying to >> retain across a restart are soft check states (those are what are being reset).  Looking >> at the MODATTR arguments in include/common.h (3.0.6) I don't see which of those
>> attributes >would govern this.  There's the *ENABLED attributes which really aren't
>> changing here (and >are retained).  All the other MODATTR's are (it seems to me) not
>> changing in this case >either.
>>
>> The second thing that confuses me here is the verbage used to describe the mask
>> functionality:
>>
>>       # RETAINED ATTRIBUTE MASKS (ADVANCED FEATURE)
>>       # The following variables are used to specify specific host and
>>       # service attributes that should *not* be retained by Nagios during
>>       # program restarts.
>>
>> So if MODATTR is set to none, based on the comment doesn't this mean that "NONE" of the >> attributes are NOT retained?  I.e. all are retained (double-negative)?  The on-line doc >> for these masks say "By default, all host and service attributes are retained."
>>
>>
> I don't know the source code behavior, but I agree with this and a default nagios.cfg has > all of the masks set to zero, presumably to not mask anything, i.e. to not affect what's > retained.
>>
>>
>> I do get masks, I just didn't see how these applied here.
>>
>> Your help is greatly appreciated.
>>
> I just did a quick experiment with the default values for *retain* variables in
> nagios.cfg - which are exactly what you quote:
>
> [1257281477] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;1;FILE_AGE CRITICAL: File not found - /tmp/nagios
> [1257281597] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;2;FILE_AGE CRITICAL: File not found - /tmp/nagios
> [1257281604] Caught SIGTERM, shutting down...
> [1257281604] Successfully shutdown... (PID=9617)
> [1257281605] Nagios 3.0.6 starting... (PID=9721)
> [1257281605] Local time is Tue Nov 03 21:53:25 CET 2009
> [1257281605] LOG VERSION: 2.0
> [1257281605] Finished daemonizing... (New PID=9722)
> [1257281715] SERVICE ALERT: localhost;File age;CRITICAL;HARD;3;FILE_AGE CRITICAL: File >> not found - /tmp/nagios
>
> Everything works as expected.
>
> I'm guessing you have some other issue that's affecting Nagios' ability to save retention data.
>
> What's the value of state_retention_file and retention_update_interval for you?
>
> Have you checked that state_retention_file is updated when Nagios runs, that you're not close to capacity of the disk or that something basic like that is going on?
>
> Open up the file and grab the definition for the service in question, see what values are being saved.
>
> HTH,
>
> Regards,
> Martin Melin

Martin,

state_retention_file=/usr/local/eam/nagios/var/retention.dat
retention_update_interval=60

I see that retention.dat was updated when I restarted Nagios maybe 20 minutes ago.  I just tested disabling notifications for a check, but I guess based on my retention_update_interval I won't see the retention.dat file change for another 40 minutes.

Nagios monitors the filesystem itself (ie. Nagios watches itself), but the filesystem it resides on is at 35% with 12GB free.  If there were a problem with that or some other essential operation of Nagios, I think I'd see some other problem.  In this case, I think an unusual set of circumstances were at play -- I was restarting Nagios every few minutes while a host was in the process of failing host checks as reported by the distributed nodes.  Never seen that before, but also probably never happened to do it that way either.

Looking at this item in retention.dat (it's a host check that we had this issue with, not a service check).  This might not be all that useful as this issue isn't happening at the moment.   At present, I see the following of interest

last_state=0
last_hard_state=0
current_attempt=1
max_attempts=10
state_history=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

I assume perhaps that "last_state" might mean last soft state.  It would be interesting to see this value if I could find a practical way to replicate this condition.  I would also expect current_attempt to be higher than 1 and the state_history to show some non-OK states while this issue was happening.  As I say, I'd have to see these values while this was changing.

Thanks

Mark

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list