Nagios-devel digest, Vol 1 #807 - 8 msgs

Andreas Ericsson ae at op5.se
Tue May 10 10:29:16 CEST 2005


sean finney wrote:
> On Tue, May 10, 2005 at 12:40:22AM +0200, Andreas Ericsson wrote:
> 
>>Actually, I made a mistake there. Only two fork()'s, as /bin/sh calls 
>>execve() more or less immediately.
> 
> 
> are you sure about that?

Yes. At least this is how bash does it on linux, according to strace.

>  this is going off on a bit of a tangent,
> but i believe "/bin/sh -c command" forks before it executes command
> (as command can be "subcmd1; subcmd2").

It parses and splits arguments prior to calling execve and forks when 
necessary.

> 
>>>and i'm suggesting
>>>- nagios forks
>>
>>nagios child does a symbol lookup of plugin_function
> 
> 
> yes, but in the "grand scheme" of things, the lookup could be a one
> time cost, leaving a function pointer for calling the plugin.
> 

No, because nagios needs to know on every invocation if this plugin is 
dlopen()'ed or executed binary. Adding weight to the already 
too-large-for-posix service struct wouldn't be a good idea (although it 
should be reworked to have many of its boolean ints tagged as flags 
instead).

> 
>>True. Didn't think of that. But with that in mind it might be better to do
>>nagios -> fork("pure" child) -> execve()
>>which would also save us the (currently not) superfluous fork()'s. This 
>>is a trick question. Or is it? ;)
> 
> 
> aha, got you!  
> 

Then you know that this method is what happens (more or less) in the 
multiplexing method, which turned out to be simple but doesn't scale 
beyond max_concurrent_checks=(OPEN_MAX-5) or so without some serious voodoo.

> 
>>>i think you and i are barking up two slightly different trees here.
>>
>>We're both discussing performance improvements in nagios, right?
> 
> 
> yes, but i was talking chiefly wrt avoiding all the fork/exec overhead,
> and you're talking a "slightly" grander scheme improvement.  note that
> these are not mutually exclusive.

In some cases they are. Pure multithreading would do well not to meddle 
with in-house plugins, since that leaves no means of protection for the 
core in case the plugin misbehaves (which it's bound to do sooner or later).

>  in fact, i'd be willing to bet
> that at some point the parallel nature of a multithreaded nagios
> daemon will hit right back up against a serialized bottleneck, that
> being the number of forked processes vs the number of processors
> available.

Number of CPU's doesn't matter. It's the number of active processes 
that's important. Most of nagios' time is lost idling between two chunks 
of checks because it doesn't parallellize well enough (one chunk needs 
to finish completely before the next can begin). A great improvement 
would be to add checks to the queue in serial but letting them run as 
they become available.

>  in such a case this small improvement might prove to be
> a little helpful afterall.
> 

Nopes, it won't. The forked children just hang around and wait for the 
popen call to finish (pclose and fgets both block), which means they 
don't claim a CPU and thus aren't eligible for activation in the kernel 
scheduling loop.

> 
>>mechanism (either fork() or pthread_create()) which makes the entire 
>>notion somewhat ridiculous for plugins that are used less frequently 
>>than say once every five seconds. The only plugin I can think of that 
>>falls into that category would be a ping check and possibly interface 
>>related snmp-checks for switches and routers in really huge networks.
> 
> 
> the idea isn't so much to make any individual check any faster, but
> to reduce the overall load on the server.  
> 
> to argue against myself it's worth noting that of the two plugins you
> mention this being the most helpful, one already calls popen inside of
> itself, and the second is currently a perl plugin :/
> 

check_icmp doesn't. I wrote it to increase the abysmal performance of 
hostchecks and it works miracles when faced with broken routers. I could 
easily implement something similar in nagios, but it would be more or 
less lost on service checks because nagios doesn't schedule chunk2 until 
chunk1 is finished, so it's the slowest finishing plugin that decides 
when chunk2 can get started (Ethan, correct me if I'm wrong).

> 
>>I'd be happy to revise my opinion on this if you were to provide a PoC 
>>that shows that dlopen() works significantly (5% or more) faster than 
>>the popen() approach while running checks in parallell and reaping the 
>>results in a satisfactory manner (1 line of output, no leaks, get 
>>return-value). It would also be nice if the current level of protection 
>>remains (plugins can sigsegv without affecting the main process). So far 
>>I've seen nothing that even comes close to this.
> 
> 
> i'll throw something together tonight, even if it is just as an academic
> excercise.  we'll see, i guess :)
> 

Ah, the lure of intellectual masturbation. ;)
I'm looking forward to seeing the code and the benchmarks.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click




More information about the Developers mailing list