Host down, still doing active checks, causing multiple unwanted service failures

Toussaint OTTAVI t.ottavi at medi.fr
Tue Dec 9 12:35:15 CET 2008


Hi Mark, thank you for your answer,

Marc Powell a écrit:
> Nagios is first and foremost a service monitor, not a host monitor.  
> Host monitoring is only necessary, as far as nagios is concerned, for  
> two reasons --
> 	- notification supression. If the host is down, don't notify about  
> the services. They're still down so show them down, but don't wake  
> anybody up over it if they're not also responsible for the host.
> 	- parenting/unreachable logic.
>   

I agree with you. Parenting / unreachable logic is a very good thing. 
But I think it should allow to declare a service as a child of its host. 
This parent/child logic can suppress 'notifications'. I think it could 
also suppress the display of inaccurate 'status' on the console window.

We do not use email notifications, because we are only 2 guys, and this 
would generate too much messages. We periodically check the web console, 
and we use on all our PCs small plugins for Firefox and Windows that 
display in a small popup the list of errors/warnings. When a host is 
down, we just get pages of errors about all service errors, when we 
would like to have just one. It would be interesting for us if the 
parent/child notification suppression mechanism could also suppress 
these unwanted displays.

> Nagios is designed to show the current state of services as accurately  
> as possible. This helps explain the 'why' of the behavior you are  
> seeing and works very well to cover the edge cases that your goal  
> won't catch. For example, if your host check is a ping and something  
> borks ICMP on your network, you would have all the services on that  
> host disabled and set to unknown, even though they are working just  
> fine. 

That's not what happens. Most of the monitored hosts are located on 
WANs. These links, at least those from my office, are used only for 
remote control and remote administration, thus they're build with cheap 
technologies, not intended to be highly reliable. When a host becomes 
not pingable, then it usually means the WAN link is down. The action is 
usually to reboot a router, or reset a VPN tunnel. But, during this 
time, there's no sense for me to send hundreds of checks through this 
wan, because they will fail. And there's no need for me to know the 
services are in a failed status. They may be working fine. But the 
service check won't have any chance of success, because of  WAN failure. 
Then, what I would expect in the service status is "UNKNOWN". Same as 
when a child becomes "UNREACHABLE" because of parent down

> Your understanding of exactly what is impacted on that host is  
> now completely wrong. By artificially changing the service state, your  
> reporting is no longer reliable as well. You may be fine with that but  
> understand that your goal is opposite of what nagios is meant to do.
>   

In my configuration, WAN failures occur far more often than general 
crash of a host causing lots of services down. I agree with you, when 
the WAN is down, my understanding of exactly what is impacted on the 
host is completely wrong. Nagios says all the services are down, when it 
should say, in my opinion, that it could not determone the status of the 
services.

Moreover, plugins from various sources behave differently when the host 
is unreachable. Some plugins return UNKNOWN, which may be the most 
accurate result in such a sutuation. But some plugins return FAILED, and 
also some plugins return WARNING. This adds a little bit more confusion 
to the console, where it may not be easy to find the original problem.


> Instead of disabling the service checks, you  
> may be able to use adaptive monitoring to change the service  
> check_commands to something that always returns UNKNOWN (i.e.  
> check_dummy). 

I already think about that. But I would have to change every 
check_command for every service. And, more complicated, I will have to 
put back the contents of all the original service checks when the host 
comes back. About disabling the services, there's an external command 
called "DISABLE ALL SERVICE CHECKS" for a particular host, so that I can 
disable all services in one go But to change service check_commands, I 
would have to do that for every service, which would be very huge and 
quite difficult to maintain ! Each remote server has approximately 20 
service checks, some hundred services total, and this is only the 
beginning, the full setup would require some thousands of checks, all of 
them located over poor WAN links...

In fact, parent/child mechanism seems to be the right way to handle 
hosts located over WANs or routers. In my opinion, it should be possible 
to consider services as childs of their parent host. This may be a 
feature request for future versions...

Following this idea, I will investigate the following :
- Hosts associated themselves with parent/child relationship according 
to WAN topology (already working)
- For each host, I will create a "parent" service with only a 
check_alive command
- Every other service will be a child of this parent service

I'll try right now. Comments and suggestions are welcome. Am I the only 
one having this problem ?

Kind regards
-- 

*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
*Mail:* t.ottavi at medi.fr

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20081209/dc4804d0/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list