Questions about scheduling

Andreas Ericsson ae at op5.se
Tue Dec 19 13:18:05 CET 2006


Hugo van der Kooij wrote:
> On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> 
>> Yes, for reasons stated above. It gets slightly worse if you have a
>> largely linear network (many hosts only have one child), since it also
>> has to check parent hosts until it finds the "closest" possible "up" to
>> determine where a possible network outage is happening.
> 
> Just curious. How will this work if you have something like 5 hosts in 
> line in a parent-child relation?
> 
> The fastest way would be starting from nagios and work your way to the 
> downed host as the average latency on a check on a live host is much 
> faster then the timeout you get on downed hosts.
> 
> Considere the map as shown on 
> http://hvdkooij.xs4all.nl/statusmap-20061219.png
> 
> If nagios detects the ipv6 router in the lab to be down and it has to work 
> it's way up it has to deal with the timeouts on nlams04 and nlams05.
> 
> If it starts polling the other way around it only has to deal with the 
> host check latency of the switch and the timeout of nlams05.
> 

In the case you posted on your map, it would indeed be faster to start 
walking in -> out. However, if the closest parent had been up it would 
have been the other way around. Anyways, I *think* nagios checks 
in->out. Either way, it's important for a host check to return OK 
*immediately* when it finds that the host it's checking actually *is* 
ok, which is why I wrote check_icmp and let it have a check_host mode 
which does just that. The original default hostcheck (I think it's still 
the default, btw) would wait a minimum of 5 seconds no matter if the 
first ping came back ok after 5ms. Since all other checks are stopped, 
this causes quite a bit of slowdown.

When I think about it, it would indeed (especially with check_host) be 
faster to start unreachability checks with root-hosts and then following 
children down to the targeted host since we would that way encounter a 
minimum of host check timeouts. It's programmatically slightly trickier 
though, as you'd have to walk it backwards from the problem host, push 
each parent host to a stack and then pop them from that stack when you 
do the actual checking. I'll look in this and see if a patch is 
necessary and, if so, if I can come up with one.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list