how parallel are host checks? (after service check fails)

Andreas Ericsson ae at op5.se
Sat Aug 13 11:40:52 CEST 2005


Marc Powell wrote:
> 
>>-----Original Message-----
>>From: nagios-users-admin at lists.sourceforge.net [mailto:nagios-users-
>>admin at lists.sourceforge.net] On Behalf Of Juhani Tali
>>Sent: Friday, August 12, 2005 7:43 AM
>>To: nagios-users at lists.sourceforge.net
>>Subject: [Nagios-users] how parallel are host checks? (after service
> 
> check
> 
>>fails)
>>
>>
>>I have read something about that Nagios will stop processing all
> 
> service
> 
>>checks on all hosts if some host goes down, until the host check is
>>finished.
>>
>>Is this true? (also in 2.x?)
>>
> 
> 
> Yes. In fact, _all_ other activities stop.
> http://nagios.sourceforge.net/docs/1_0/checkscheduling.html - Host
> Checks section. This behavior is maintained in 2.x. Nagios is primarily
> a service monitor, not a host monitor.
> 
> 
>>The problem is that I have about 400-500 hosts (cisco routers mostly)
>>and I would like if a problem with one host would not delay locating
> 
> the
> 
>>problem on another host. Some, but not all host parents are defined.
>>If it is true, that a host check freezes the entire nagios, then is
>>there a way to make it more parallel?
> 
> 
> Your host checks should be as simple as possible and finish as quickly
> as possible. Running the checks in parallel defeats some of the other
> functionality such as determining unreachable v.s. down hosts.
> 

This isn't necessarily true, but some clever scheduling needs to be done 
to be able to run the chain of checks needed for unreachable 
determination. I've got a couple of ideas on this, but they all involve 
rather heavy modifications to how Nagios schedules and runs its checks. 
Most importantly, it needs to do inter-thread communication by some 
other means than sending results through a FIFO.

A good start is to use the check_icmp plugin in check_host mode, which 
will return OK immediately upon receiving a valid ICMP ECHO response, 
cutting 4 seconds down to around 0.1 from the default hostcheck when the 
host is actually up, and 100 seconds to about 45 when the host is down.

If this could be serialized, the complexity of (seconds * 
max_check_attempts * links_in_chain) to simply (seconds * 
max_check_attempts). In combination with check_icmp in check_host mode, 
a host with parents in 5 levels will take 45 seconds to run, rather than 
around 10 minutes with the default check-host and the current 
checkscheduling.

> See 7 and 8 from http://nagios.sourceforge.net/docs/1_0/tuning.html
> 
> --
> Marc
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list