Service check on DOWN hosts!!!

Toussaint OTTAVI t.ottavi at medi.fr
Thu Feb 26 17:23:05 CET 2009


Marc Powell a écrit:
> On Feb 23, 2009, at 9:43 AM, Sergio Ariel wrote:
>
>   
>> My problem is that when these host are DOWN, Nagios wait 30 seconds
>> trying to execute the service check. After this 30 seconds, then  
>> tell me
>>   "CRITICAL SERVICE". I want to avoid Nagios checks any service in  
>> DOWN
> Nagios isn't designed to do this; you'll need to jump through hoops to  
> accomplish it. At the least you need to be running nagios-3 with  
> active host checks configured. I'd suggest you look at creating an  
> event handler for your hosts that issues the external command  
> 'DISABLE_HOST_SVC_CHECKS' when the host is non-OK and issues the  
> external command 'ENABLE_HOST_SVC_CHECKS' when the host recovers.
>   

Hi,

I already posted the same problem some months ago.

I tried Mark's workaround using event handlers and external commands. I 
also tried another smart workaround  using service dependancy. You 
create a 'check_host_alive' explicit service, then you create a service 
dependancy, so that all your other services are not checked if the main 
service fails. Using wildcards can be helpful. Search my name 'OTTAVI' 
in the list history, and you'll find more details about these two 
workarounds.

Anyway, both of these workarounds do not completely solve the problem. 
If some service checks are scheduled BEFORE the event handler triggers, 
or before the dependancy operates, then these service checks will return 
'FAILED' status. Some optimization can be done by reducing check 
interval for the 'parent' check_alive service, but you will still get 
some 'FAILED' status for some checks that are scheduled before...

I'm having this trouble for months, but I didn't find any suitable solution.

Nagios has a 'parent/child' relationship system, which could be helpful 
in such a situation. But it works only for hosts. There are no 
parent/child relationship between services, or between hosts and 
services, which could solve our problem completely ! Let's hope the 
developpers will take our problem into consideration for future versions.

Another idea of a better workaround would be using an event handler, not 
to only to disable all service checks, but also to put all of them in an 
'UNKNOWN' state (this would simulate the parent/child/unreachable 
logic). Then, even if the event handler trigger AFTER some (failed) 
service checks, the 'FAILED' status would be replaced by a more accurate 
'UNKNOWN' status. Unfortunately, I didn't manage to do that in a script 
to be run as an event handler. I don't know if it is possible. But my 
global knowledge about scripting is quite poor. Maybe a Perl or Bash 
guru could help us writing such a script ?

Kind regards,
-- 

*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
**


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090226/9ef06e4f/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list