max_concurrent_checks=0 not working on 3.2.1 (maybe earlier versions, too)?

Andreas Ericsson ae at op5.se
Mon Apr 19 14:15:01 CEST 2010

Previous message: max_concurrent_checks=0 not working on 3.2.1 (maybe earlier versions, too)?
Next message: Case-insensitive objects in NDOutils
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04/19/2010 01:53 PM, Max wrote:
> On Mon, Apr 19, 2010 at 5:12 AM, Andreas Ericsson<ae at op5.se>  wrote:
>>> This would indicate that max_concurrent_checks=0 is limiting it to
>>> some number rather than using all the CPU possible, which sounds like
>>> a bug.
>>>
>>
>> Yes. I expect this has to do with the smart check interleave factor and
>> wildly different check_interval variables. Nagios does something wrong
>> there, but it's unusual enough (and mild enough) for most people that
>> noone has bothered to correct it yet.
> 
> Will be very interested in seeing your results and hope you are able
> to do great things with the process pooling and your investigation
> into this.
> 
> As part of our mostly *black box* tuning for better performance, we
> turn off the 'smart' intervals and we set check interleave factor to
> always equal the number of hosts we are polling.  We also removed the
> section of code where Nagios sleeps on non-runnable events and made
> the sleep time between checks as small as we could with nanosleep
> enabled.  we leave max concurrent checks at 0 as well.  With all that
> in place and on a set up with a large % of ePN-based checks (~ 90%<  1
> sec execution time according to nagiostats) and a NEB module
> processing performance data we get a max of about 50 checks/second on
> a dual quad core CPU host.  With DNX we saw that rise to about 85
> checks/second but the version of DNX we tried was not honoring the
> Nagios check schedule well (that was before all the patching work that
> has been going on on the DNX project lately so might be better now).
> 

Well, initial tests show that it's not a problem to fire off 500 checks
at once with the multiplexing proof-of-concept code I have sitting on my
harddrive. The idea is that the scheduler will just throw out checks as
and when they're needed and the workers will just spawn and reap them
as fast as possible. With the workers doing nothing but that they can
saturate the CPU fairly well, and if more workers are needed it's easy
enough to spawn one (or more).

The only thing I'm worried about is that it'll now be possible to hog
100% cpu on a system running Nagios. Some smarts will have to be added
to take care of that scenario.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

Previous message: max_concurrent_checks=0 not working on 3.2.1 (maybe earlier versions, too)?
Next message: Case-insensitive objects in NDOutils
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Developers mailing list