Host down, still doing active checks, causing multiple unwanted service failures
    Toussaint OTTAVI 
    t.ottavi at medi.fr
       
    Mon Dec  8 18:38:02 CET 2008
    
    
  
Hi list,
I've been investigating this problem for a while, but I couldn't find a 
good solution.
* Example situation :
Assume I have one host with 20 service checks.
* Problem :
If the host becomes DOWN, Nagios still continues to do service checks on 
this host. So, after a while, all the services will go to a CRITICAL 
state. Then, in my console, I will see :
  - 1 Host down,
  - 20 Services down
This information is not pertinent. The only information I would see in 
such a case is the "host down". The 20 "service down" informations are 
obvious, and generate a "visual pollution" that may prevent to easily 
identify the problem.
* Expected behavior :
When a host is down, I would like to :
- See only one thing in red in the console : 1 HOST DOWN
- Disabling all the service checks (which at this point do not have any 
chance of success)
- Put the service into "UNKNOWN" status
Comments:
In Nagios, there are parent/child dependencies. When a host is down, all 
the child hosts are not tested, and their status becomes "UNREACHABLE". 
Good thing. Same thing for services. But, as far as I know, there are no 
dependencies between a host and its services. I googled/read a lot of 
things in the docs. This seems to be "by design", there's no way to 
declare a service as a child of its (parent) host ! I didn't really 
understand the reasons of this choice, but I would like to work around.
Then I played around with event handlers. When a host status changes, 
the event handler calls a script. The script checks the status of the 
"calling" host. If the host is DOWN or UNREACHABLE, it sends back to 
Nagios an "external command" to disable all active service checks. If 
the status of the host is UP, then it sends the external command to 
enable all service checks for that particular host. It works. But there 
is some "latency" between the time the services are disabled by the 
eventhandler, and the time Nagios stops doing the service checks. 
Usually, some services are still checked, and provide unwanted "FAILED" 
status. I think this is because these checks were queued before the 
handler disabled them, thus they're executed. So I'm not s100% satisfied.
The next step would be to use service event handlers to put every 
service into "UNKNOWN" status each time a service check is disabled. But 
I have two problems :
- In my external script, I can not determine if a service check is 
ENABLED of DISABLED. There are a lot of "macros" available, but none of 
them gives me this information.
- This may not solve the "latency" problem, if I manually set an 
"UNKNOWN" status on a DISABLED service, but an active check is already 
in the queue, and its result will arrive later...
Of course, the ideal situation would be to have a parent/child 
dependancy acting between hosts and services...
Any comments and suggestions are welcome. Thank you in advance for your 
help.
Kind regards
-- 
*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
***Mail:* t.ottavi at medi.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20081208/0e3d0bb3/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
    
    
More information about the Users
mailing list