High latencies problem.

Alessandro Ren alessandro.ren at opservices.com.br
Tue Feb 17 20:06:10 CET 2009



On 2/17/2009 3:15 PM, D. Emmanuel Feinsmith wrote:
> Dear Alessandro,
>
> You are more than likely eating up the cpu and memory with the 
> memcpy's executed by each fork of your check_nrpe and check_icmp 
> services. You can prove this out to yourself by using top to observe 
> the behaviour of the nagios processes. I would also suggest that there 
> is nothing else eating up CPU and memory on your nagios server box and 
> keep the box dedicated. Running top will show if there is resource 
> contention on your monitoring server. Keep in mind that check_nrpe is 
> amongst the slowest possible commands nagios can execute because it 
> has to wait for whatever timeout period you entered in your client 
> nrpe.cfg for the nrpe daemon to respond. This can take seconds in some 
> cases. A much more scalable solution is to enable passive checks 
> (using nsca/send_nsca) on some or all of your clients)
>
> I would suggest the following things (from the nagios performance 
> tuning guide):
>
> # *Check service latencies* to determine best value for maximum 
> concurrent checks. Nagios can restrict the number of maximum 
> concurrently executing service checks to the value you specify with 
> the max_concurrent_checks option. This is good because it gives you 
> some control over how much load Nagios will impose on your monitoring 
> host, but it can also slow things down. If you are seeing high latency 
> values (> 10 or 15 seconds) for the majority of your service checks 
> (via the extinfo CGI), you are probably starving Nagios of the checks 
> it needs. That's not Nagios's fault - its yours. Under ideal 
> conditions, all service checks would have a latency of 0, meaning they 
> were executed at the exact time that they were scheduled to be 
> executed. However, it is normal for some checks to have small latency 
> values. I would recommend taking the minimum number of maximum 
> concurrent checks reported when running Nagios with the -s command 
> line argument and doubling it. Keep increasing it until the average 
> check latency for your services is fairly low.
>
> # *Optimize host check commands*. If you're checking host states using 
> the check_ping plugin you'll find that host checks will be performed 
> much faster if you break up the checks. Instead of specifying a 
> max_attempts value of 1 in the host definition and having the 
> check_ping plugin send 10 ICMP packets to the host, it would be much 
> faster to set the max_attempts value to 10 and only send out 1 ICMP 
> packet each time. This is due to the fact that Nagios can often 
> determine the status of a host after executing the plugin once, so you 
> want to make the first check as fast as possible. This method does 
> have its pitfalls in some situations (i.e. hosts that are slow to 
> respond may be assumed to be down), but you'll see faster host checks 
> if you use it. Another option would be to use a faster plugin (i.e. 
> check_fping) as the host_check_command instead of check_ping.
>
> # *Schedule regular host checks.* Scheduling regular checks of hosts 
> can actually help performance in Nagios. This is due to the way the 
> cached check logic works (see below). Prior to Nagios 3, regularly 
> scheduled host checks used to result in a big performance hit. This is 
> no longer the case, as host checks are run in parallel - just like 
> service checks. To schedule regular checks of a host, set the 
> check_interval directive in the host definition to something greater 
> than 0.
>
> # *Enable cached host checks*. Beginning in Nagios 3, on-demand host 
> checks can benefit from caching. On-demand host checks are performed 
> whenever Nagios detects a service state change. These on-demand checks 
> are executed because Nagios wants to know if the host associated with 
> the service changed state. By enabling cached host checks, you can 
> optimize performance. In some cases, Nagios may be able to used the 
> old/cached state of the host, rather than actually executing a host 
> check command. This can speed things up and reduce load on monitoring 
> server. In order for cached checks to be effective, you need to 
> schedule regular checks of your hosts (see above). More information on 
> cached checks can be found here.
>
> For more, see:
>
> /http://nagios.sourceforge.net/docs/3_0/tuning.html/

     Daniel,

     I've read this DOC more than once in my search to bring the latency 
down.
     Passive checks are not a possilibity right now, maybe with another 
nagios instance, this would be OK.
     I am trying to avoid having to use another nagios instance for now, 
but I have this option also in mind.
     I've already used max_concurrent_checks=0  and I've not noticed any 
change in latency times.

     Tks.

>
> If none of this works, you may have to use passive checks or multiple 
> nagios instances to drop your latency.
>
> Bon Chance!
> Daniel.
>
> On Feb 17, 2009, at 8:41 AM, Alessandro Ren wrote:
>
>> On 2/17/2009 1:32 PM, D. Emmanuel Feinsmith wrote:
>>
>>     Answers bellow,
>>> Alessandro,
>>>
>>> 1.  what is the breakdown between passive and active checks? For
>>> passive checks, there are many ways to increase the # of services
>>> through bypassing the command pipe (which nsca writes to). With
>>> passive checks done in this way I've gone to 50,000 services with
>>> under 10 second latency.
>>>
>>     All active checks, no passive.
>>
>>> 2.  how many of those services are check_icmp or check_ping? If there
>>> is a good number of those, you can use fping to reduce the # of fork/
>>> exec's that nagios has to perform, which is a major area of resource
>>> utilization within the nagios server.
>>>
>>     Less than 5% are ping checks and we use check_icmp for all those.
>> Most checks are check_nrpe,.
>>
>>> 3. Are you using a performance data handler or OCSP? If so, you might
>>> either find a way to get rid of these entirely, or be sure you are
>>> using file based performance handling at the very minimum.
>>>
>>     I am using perfparse to write to mysql. Disabling it has no effect
>> in the latency.
>>
>>> The key to nagios scalability and latency reduction is to educe the #
>>> of fork/exec's to the smallest amount possible and keep away from the
>>> command pipe as much as you can if you are passive-check heavy. If you
>>> are using all active checks, then you can balance the load between
>>> active and passive checks and thereby gain some speed.
>>>
>>
>>     In my other nagios with just 2600 services, we see around 200
>> nagios processes running in average, in the 11600 services system, the
>> average is 30 processes, it seems that the event loop in lagging, is is
>> not starting enough processes, thus raising the latency.
>>
>>     Thank you Daniel.
>>> Daniel.
>>>
>>> On Feb 17, 2009, at 8:17 AM, Alessandro Ren wrote:
>>>
>>>
>>>>    Hello,
>>>>
>>>> I have a nagios system running with 427 hosts and 11160 services and
>>>> since I reached 8000 services, I am having problems with the latency
>>>> beeing around 100s and 200s.
>>>>    use_large_installation_tweaks is enabled, max_concurrent_checks
>>>> have
>>>> been tested with 0 and higher values and I have tested this setup in
>>>> two
>>>> different HWs, a dual core with 4GB RAM 32 bits a a Dual Xeon Dual
>>>> core
>>>> 64bits with 8GB of RAM. We are using REdHat enterprise 5.
>>>>    Also reaper is already at 2s, host checks with cache horizon are
>>>> enabled with a max retry of 3, all services check every 5min.
>>>>    I have no service dependency set up.
>>>>    I've noticed that nagios is not spawning too many processes as
>>>> another nagios I have running which has far less servicexs and it
>>>> seems
>>>> that the event loop if lagging behing, in my debugs.
>>>>    Any ideas what could I do to fix that? Have I reached a limit in
>>>> nagios pooler code?
>>>>
>>>>    Tks.
>>>>
>>>> -- 
>>>> Alessandro Ren
>>>> http://www.opservices.com.br
>>>> alessandro.ren at opservices.com.br 
>>>> <mailto:alessandro.ren at opservices.com.br>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Open Source Business Conference (OSBC), March 24-25, 2009, San
>>>> Francisco, CA
>>>> -OSBC tackles the biggest issue in open source: Open Sourcing the
>>>> Enterprise
>>>> -Strategies to boost innovation and cut costs with open source
>>>> participation
>>>> -Receive a $600 discount off the registration fee with the source
>>>> code: SFAD
>>>> http://p.sf.net/sfu/XcvMzF8H
>>>> _______________________________________________
>>>> Nagios-devel mailing list
>>>> Nagios-devel at lists.sourceforge.net 
>>>> <mailto:Nagios-devel at lists.sourceforge.net>
>>>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Open Source Business Conference (OSBC), March 24-25, 2009, San 
>>> Francisco, CA
>>> -OSBC tackles the biggest issue in open source: Open Sourcing the 
>>> Enterprise
>>> -Strategies to boost innovation and cut costs with open source 
>>> participation
>>> -Receive a $600 discount off the registration fee with the source 
>>> code: SFAD
>>> http://p.sf.net/sfu/XcvMzF8H
>>> _______________________________________________
>>> Nagios-devel mailing list
>>> Nagios-devel at lists.sourceforge.net 
>>> <mailto:Nagios-devel at lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>>
>>
>> ------------------------------------------------------------------------------
>> Open Source Business Conference (OSBC), March 24-25, 2009, San 
>> Francisco, CA
>> -OSBC tackles the biggest issue in open source: Open Sourcing the 
>> Enterprise
>> -Strategies to boost innovation and cut costs with open source 
>> participation
>> -Receive a $600 discount off the registration fee with the source 
>> code: SFAD
>> http://p.sf.net/sfu/XcvMzF8H
>> _______________________________________________
>> Nagios-devel mailing list
>> Nagios-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>    

------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H




More information about the Developers mailing list