Severe peformance issue during major network outage

Aidan Anderson mail at aidananderson.co.uk
Sun May 13 21:24:47 CEST 2007


Aidan Anderson wrote:
> Ton Voon wrote:
>   
>> On 11 May 2007, at 20:25, Aidan Anderson wrote:
>>
>>   
>>     
>>> First of all, thank-you for the replies!
>>>
>>> The majority of devices that I monitor are routers/vpn devices and I
>>> have (on the documentation's advice) not set active checks on the  
>>> hosts
>>> and instead I've added check_ping as a service on each of these  
>>> hosts to
>>> do 5 pings as follows:
>>>
>>> check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
>>>
>>> For the host check I already use as you suggested a check_ping that  
>>> only
>>> does one ping as follows:
>>>
>>> check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 1
>>>
>>> My understanding was that if the service check failed it would then
>>> abandon the service check altogether and move onto the host check  
>>> which
>>> is only 1 ping.  The fact that the service checks are parallelised
>>> should mean that it shouldn't matter that there are 5 pings and the  
>>> host
>>> check is only 1 ping which should resolve the bottleneck of serialised
>>> host checks.  I'm at a loss as to why performance has been impacted so
>>> severely.
>>>
>>> Maybe I need to abandon the service checks altogether and just have a
>>> host check.  I'm reluctant to do this because I get very useful
>>> information from 5 pings, ie packet loss and high rta which is
>>> particularly handy for checking volatile links such as ADSL.  Maybe  
>>> that
>>> is the trade-off, fast host checking with no useful stats or slow host
>>> checking with useful stats.
>>>     
>>>       
>> Just noticed this in your original email:
>>
>> Host Check Execution Time:       0.03   / 10.04   / 0.843 sec
>>
>> This means that some of your host checks are taking 10 seconds, which  
>> is, funnily enough, the timeout period for check_ping. So the -p 1  
>> will still take 10 seconds if the routers are not responding.
>>
>> You can use a timeout flag for check_ping (but is only supported on  
>> some OSes). I guess check_icmp is a better bet here.
>>
>> Ton
>>   
>>     
> Hi Ton,
>
> Well spotted, thank-you.  check_icmp here we come :)
>
> thanks
> Aidan
>   
I've now changed my host and services checks to use check_icmp instead 
of check_ping.  It seems to work far more efficiently and has dropped my 
average service and host check execution times from 11 seconds to 4-5 
seconds.

It didn't, however, make Nagios notice the hosts go down any quicker.  
It still took an hour to notice that 109 hosts had gone down and during 
that hour, latency times shot up above 2000 seconds.  Once it had 
finally noticed that all 109 hosts were down, latency times dropped back 
to normal.  This must be down to the serialisation of host checks so 
I'll wait patiently for the stable release of version 3.

Thanks again for the replies.

regards,
Aidan


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list