Some hard state changes missing in NDOUtils

Ton Voon ton.voon at altinity.com
Tue Nov 13 18:32:36 CET 2007


Andreas,

On 13 Nov 2007, at 16:13, Andreas Ericsson wrote:
> It would be best to follow the path of the least surprise. In this  
> case,
> I think this simple rule describes that path rather well:
> "Whenever a state change occurs, the check attempt value is reset."

I think you are confusing a state change from OK to a failed state,  
when I have clearly said that this is a state change from a failed  
state to a different failed state at the same time that a host is  
discovered to be down/unreachable.

http://nagios.sourceforge.net/docs/2_0/statetypes.html

...says that a hard state change occurs when it goes from a "hard non- 
OK state of some kind to a hard non-OK state of another kind (i.e.  
from a hard WARNING state to a hard UNKNOWN state)". I assert that  
this does not happen in this particular case.

There is nothing on that documentation page about check attempt  
values between hard states, but I also suggest that if something is  
on check attempt 3 out of max attempts 3 in a warning state, then if  
the next result is critical, the check attempt should remain at 3.  
I've just tested this using passive checks and this appears to be  
true, so your assumption that "whenever a state change occurs, the  
check attempt value is reset" is not current Nagios behaviour.


> Besides that, I'm curious to know how this changes notification  
> behaviour.

Notification logic has not been touched. The fix for the hard state  
change is the call to handle_service_event() (which in turn calls  
event handlers as well as update NDO).

Normally, every call to log_service_event() (which puts the  
nagios.log entry in) is followed by a handle_service_event(). But  
this is missing in this scenario, which is how I reached this  
conclusion.

> On a side note, I'm a little unclear about what you're actually  
> reporting
> as the bug. The fact that obj->current_attempt is reset, or the  
> fact that
> state entries are missing from the NDOUtils table. The report seems to
> imply both, while common sense suggests the latter and the patch  
> amends
> the former.

Apologies. Looking back, my summary alluded to two bugs, but I failed  
to fully detail them. For the record there are two bugs:

   1) check_attempts is reset incorrectly when a service is currently  
in HARD state and the host has just failed
   2) the event handlers are not called in this scenario, thus a  
record is not propagated to NDO

Thinking about (1) a bit more, it is possible that a service in a  
soft error state would show a hard error with check attempts 1/3,  
which is counter-intuitive. However, this is the case mentioned in  
http://nagios.sourceforge.net/docs/2_0/statetypes.html where "When a  
service check results in a non-OK state and its corresponding host is  
either DOWN or UNREACHABLE [causes a hard state change]. This is an  
exception to the general monitoring logic, but makes perfect sense.  
If the host isn't up why should we try and recheck the service?".

However, I can see that there may still be problems with my fix to  
(1), so peer review is welcome.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon



-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list