freshness check bug?

Bryan Loniewski brylon at jla.rutgers.edu
Wed May 11 20:21:09 CEST 2005


I'm using NSCA, but as I've mentioned in my original post, I turned OFF receiving any
packets so I could check the behavior of freshness_checking (i.e., my service is not
getting "fresh" results, which is what I want since I'm testing what happens when this
scenario exists for real later on ;)

_________________________
Bryan Loniewski
Rutgers University
NBCS - Systems Programmer

On Wed, 11 May 2005, admin at jpk236.com wrote:

> Bryan,
> 	You never mentioned, and I forgot to ask.  What method are you using 
> to send the passive checks from the distributed monitored servers to your 
> central server?  NSCA?  If so, are those servers configured correctly to send 
> the data?  Is the central server configured correctly to receive the data?
>
> - Justin Kulikowski
> 	[ http://www.jpk236.com ]
>
> Bryan Loniewski wrote:
>> Regardless of what freshness_threshold I pick (as long as it's not too 
>> unrealistic), I just want clarification if a bug exists? (By the way, where 
>> do you see the default
>> freshness threshold is 300 sec?). Anyway, I increased the threshold just 
>> now to 180
>> seconds and the only thing in my nagios.log was:
>> 
>> [1115831032] Finished daemonizing... (New PID=16154)
>> [1115831272] Warning: The results of service 'PROCS-NAGIOS' on host 
>> 'csstest2' are stale
>> by 60 seconds (threshold=180 seconds).  I'm forcing an immediate check of 
>> the service.
>> 
>> So it did not even execute my eventhandler once? I'm getting very 
>> inconsistent results!
>> 
>> NRPE and check_by_ssh are not acceptable methods for distributed monitoring 
>> in our
>> environment.
>> 
>> Thanks for the comments... Justin
>> 
>> _________________________
>> Bryan Loniewski
>> Rutgers University
>> NBCS - Systems Programmer
>> 
>> On Wed, 11 May 2005, admin at jpk236.com wrote:
>> 
>>> Bryan,        A freshness_threshold of 60 seconds might be a little 
>>> unrealistic.  The default value for the threshold is 300 seconds (5 
>>> minutes).
>>>     If you want almost real-time stats, which appears to be what you're 
>>> going for, perhaps you want to try NRPE or check_by_ssh as an alternative 
>>> method of doing distributed monitoring.
>>> 
>>> - Justin Kulikowski
>>>     [ http://www.jpk236.com ]
>>> 
>>> Bryan Loniewski wrote:
>>> 
>>>> While trying to setup failover in a distributed environment, I came 
>>>> across the following
>>>> problem (bug?) involving freshness checking.
>>>> 
>>>> Note: The host that this is setup on is NOT receiving any passive checks 
>>>> while I am
>>>> testing the freshness checking.. so the results are always stale forcing 
>>>> the freshness
>>>> check everytime.
>>>> 
>>>> Note2: Relevant config snippets are under my .sig
>>>> 
>>>> Trying to configure (passive) service freshness checking to execute an 
>>>> eventhandler
>>>> works correctly for 1 or 2 iterations.. BUT no more than that. It seems 
>>>> to stop checking
>>>> the freshness after at most 3 iterations and stops executing the 
>>>> eventhandler after at most 2 iterations. I've replicated this behavior 
>>>> (too) many times and the results are
>>>> inconsistent.
>>>> 
>>>> Below is the output of my nagios log:
>>>> 
>>>> <snip nagios.log>
>>>> [1115822708] Finished daemonizing... (New PID=15941)
>>>> [1115822828] Warning: The results of service 'PROCS-NAGIOS' on host 
>>>> 'csstest2' are stale
>>>> by 60 seconds (threshold=60 seconds).  I'm forcing an immediate check of 
>>>> the service.
>>>> [1115822838] SERVICE ALERT: 
>>>> csstest2;PROCS-NAGIOS;CRITICAL;SOFT;1;CRITICAL
>>>> [1115822838] SERVICE EVENT HANDLER: 
>>>> csstest2;PROCS-NAGIOS;CRITICAL;SOFT;1;slave-failover
>>>> [1115822948] Warning: The results of service 'PROCS-NAGIOS' on host 
>>>> 'csstest2' are stale
>>>> by 60 seconds (threshold=60 seconds).  I'm forcing an immediate check of 
>>>> the service.
>>>> 
>>>> Notice the freshness check ran ONLY 2 times when it should have run 5 (if 
>>>> you look at my
>>>> config options below) and the eventhandler ran ONLY 1 time, when it 
>>>> should have ran 3 times.
>>>> 
>>>> Can anyone verify (disprove) this behavior? Am I missing something?
>>>> 
>>>> _________________________
>>>> Bryan Loniewski
>>>> Rutgers University
>>>> NBCS - Systems Programmer
>>>> 
>>>> <snip nagios.cfg>
>>>> check_service_freshness=1
>>>> service_freshness_check_interval=60
>>>> <snip>
>>>> 
>>>> <snip objects.cfg>
>>>> define service{
>>>>          name                            generic-service
>>>>          parallelize_check               1
>>>>          obsess_over_service             1
>>>>          check_freshness                 0
>>>>          freshness_threshold             60
>>>>          notifications_enabled           1
>>>>          event_handler_enabled           1
>>>>          flap_detection_enabled          1
>>>>          failure_prediction_enabled      1
>>>>          process_perf_data               1
>>>>          retain_status_information       1
>>>>          retain_nonstatus_information    1
>>>>          is_volatile                     0
>>>>          max_check_attempts              5
>>>>          normal_check_interval           2
>>>>          retry_check_interval            1
>>>>          check_period                    24x7
>>>>          contact_groups                  super-admins
>>>>          notification_interval           3
>>>>          notification_period             24x7
>>>>          register                        0
>>>> }
>>>> define service{
>>>>          use                             generic-service
>>>>          name                            generic-passive-service
>>>>          active_checks_enabled           0
>>>>          passive_checks_enabled          1
>>>>          register                        0
>>>> }
>>>> define service{
>>>>          use                             generic-passive-service
>>>>          host_name                       csstest2
>>>>          service_description             PROCS-NAGIOS
>>>>          check_freshness                 1
>>>>          freshness_threshold             60
>>>>          check_command                   check_dummy!2
>>>>          event_handler                   slave-failover
>>>> }
>>>> define command{
>>>>         command_name    check_dummy
>>>>         command_line    $USER1$/check_dummy $ARG1$
>>>> }
>>>> define command{
>>>>         command_name    slave-failover
>>>>         command_line    $USER2$/failover $SERVICESTATE$ 
>>>> $SERVICESTATETYPE$
>>>> }
>>>> <snip>
>>>> 
>>>> 
>>>> -------------------------------------------------------
>>>> This SF.Net email is sponsored by Oracle Space Sweepstakes
>>>> Want to be the first software developer in space?
>>>> Enter now for the Oracle Space Sweepstakes!
>>>> http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click
>>>> _______________________________________________
>>>> Nagios-devel mailing list
>>>> Nagios-devel at lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>> 
>>> 
>


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click




More information about the Developers mailing list