Nagios-devel digest, Vol 1 #807 - 8 msgs

Andreas Ericsson ae at op5.se
Tue May 10 00:40:22 CEST 2005


sean finney wrote:
> hey,
> 
> On Mon, May 09, 2005 at 08:47:42AM -0700, nagios-devel-request at lists.sourceforge.net wrote:
> 
>>From: Andreas Ericsson <ae at op5.se>
> 
> 
>>Zero overhead is just not going to happen. Nagios MUST be able to 
>>execute checks in parallell. It can't do that if it just enters a 
>>function instead without forking, threading or multiplexing (actually it 
>>can't do that without forking or threading, but popen() forks, so to 
>>multiplex the results from it would be a sort of mix of both worlds), as 
>>that would imply a serialized execution.
> 
> 
> you have a point that there's going to need to be some kind of fork
> or multi-threading capabilities.  but calling a function in a forked
> process or thread would still be much better performance-wise than the
> multiple fork and exec calls in the current implementation.  
> 

Yes, but not by the ridiculous amounts your testcase showed, which is 
why this is a questionable approach wrt the amount of work required to 
make it work (two ends, plugins and core must be fixed).

> 
> 
>>It would require a huge re-design of current arch. It would also require 
>>a huge re-design of most plugins, since they don't clean up after 
>>themselves as it is today. They also use very shoddy function-calls. Not 
> 
> 
> that wouldn't be as much of a "redesign" as it would be a code-cleanup,
> which is never a bad thing to do anyway.  plus, what i'm suggesting
> isn't an all-or-nothing switchover, but a conditional switch.  plugins
> could be audited for poor memory management etc and as they are approved
> added to a list of plugins to be added to the shared object target list.
> 

I still don't buy it.

> 
>>to mention; plugins that crash would cause nagios to crash. This just 
>>isn't good enough.
> 
> 
> even forked children?
> 

No, but threads in a worker list that picks up checks as they're ready 
to run. This approach effectively puts a stop to all such approaches.

> 
>>Three fork()'s and two execve()'s, as nagios itself forks once prior to 
>>running popen(). execve() replaces the running process, so there's no 
> 

Actually, I made a mistake there. Only two fork()'s, as /bin/sh calls 
execve() more or less immediately.

> 
> that's the count that i got:
> 
> - nagios forks
> - nagios child calls popen
> - popen forks 
> - popen child calls execve(/bin/sh)
> - /bin/sh forks

Nopes.

> - /bin/sh child calls execve(cmd)
> - /bin/sh child (now cmd) exits with status
> 
> and i'm suggesting
> 
> - nagios forks

nagios child does a symbol lookup of plugin_function

> - nagios child calls plugin_function
> - nagios child exits return status of plugin_function
> 
> note that if this were in a multi-threaded arch, or if the child
> processes were pre-allocated, even this fork would have a negligable
> effect.
> 

No more negligable than what's already the case. I'm not sure what you 
mean by multithreaded arch (multithreaded design, SMP machine?).

> 
>>running popen(). execve() replaces the running process, so there's no 
>>context-switching. It would be possible to get rid of one of the 
> 
> 
> assuming that one fork() isn't avoidable, you still three processes
> between which you have to switch in the popen approach (nagios child,
> popen child, /bin/sh child).
> 

nagios -> fork("pure" child) -> fork (popen) -> execve(/bin/sh -c) -> 
execve(plugin).

Nothing else. I was wrong in my previous assessment.

> 
>>Arguments can contain whitespace if escaped or enclosed in strings. Do 
>>you feel like writing a function that does that and that's fast enough 
>>to run as often as is required, while still being rock-solid safe? The 
>>functions that does this in glibc and bash are asm-enhanced and 
>>finetuned per architecture they're run at. You'd increase load 
>>drastically, not reduce it.
> 
> 
> okay, so a little trickier than splitting on whitespace.  however, i
> don't see where your concerns about speed/efficiency are coming from.
> why would we need to do this every time the command is executed?  why
> not parse the cmd into arguments when the command is first read in from
> the conffile?

True. Didn't think of that. But with that in mind it might be better to do
nagios -> fork("pure" child) -> execve()
which would also save us the (currently not) superfluous fork()'s. This 
is a trick question. Or is it? ;)

>  plus, if we did that regardless of this dlopen suggestion,
> we could also cut out the popen call and just do fork/exec/dup on the
> actual command using the same argument list.
> 
> 
>>A way around this would be to rewrite the plugins more or less from 
>>scratch, and possibly make them simpler as well, while tagging them for 
>>nagios to KNOW which ones are expected to have modules installed. For 
>>instance, the check_command could look something like
> 
> 
> i don't see what this gets anyone, apart from more work to accomplish
> effectively the same task.
> 

Have you seen the state of the plugins today? It's a _LOT_ of shoddy 
code running around in there (many plugins popen() by the way).

> 
>>popen() is fork() + dup() + execve(), more or less. Read glibc-2.3.5 
> 
> 
> popen is fork + dup + exec (/bin/sh -c) + fork + dup + exec (your command).
> 

sh doesn't fork. It just calls execve() to overwrite itself with the new 
process (there's no call named exec, btw) and dup() isn't used, it's 
pipe() (my bad from the beginning), so

popen() is fork() + dup() + execve() + execve()

according to glibc-2.3.5/libio/iopopen.c

> (and later, in another mail)
> 
> 
>>Sean, it'd be interesting to compare this test with the dlopen() idea of 
>>yours. Make it time itself so that timing starts after dlopen() and each 
>>command just requires a table lookup, symbol lookup, fork, execution and 
>>collection.
> 
> 
> this wouldn't be the best example of a test, because executing /bin/ls
> is doing a fork/exec, which is exactly what the idea is trying to avoid.

Your test was a bit rubbish (sorry, but it was) because it didn't 
provide the most important part, which is that of parallellisation. 
Serialized internal function execution is quite obviously a helluva lot 
faster than running popen() on an external program. Everybody knew that, 
nobody has argued against it, and noone ever will.

> granted, many plugins do this internally too.  however, i would be
> interested to see how the dlopen approach handled itself within a
> multithreaded environment.  
> 
> i'll see about grabbing a smaller plugin (such as the check_rand you
> mention) and testing that.  i'll also try and make it more realistic
> to what goes on inside nagios (calling the function from a child).
> 
> (and yet later)
> 
> 
>>You really need to use a real plugin while doing this test since it's 
>>quite obvious that calling a function that does nothing is a lot faster 
>>as an in-core function than as an external program. My experiment showed 
>>benchmarks between two different ways of executing external programs, so 
>>it's ok for that to use nonsense data that returns quickly, while yours 
>>focus on the entire execution cycle from execution-start to 
>>execution-end. You really need to do something a bit more real to 
>>investigate the time gained for that (hint, the most time is spent in 
>>the plugin).
> 
> 
> actually, last night i started with check_tcp and saw a similar trend
> (though not the 4 orders of magnitude seen here).  the reason i didn't
> post that was posting instructions on how to properly build check_tcp
> as a shared object was slightly more complicated.
> 

But you were still executing checks in serial. To increase performance 
you need to be able to execute a lot of checks within a small window of 
time. Each check can take as long as it likes, so long as it finishes 
within that timeframe.

> 
>>to fire up new checks as old ones complete either (err, that's what it 
>>does, but in a serial manner), while both mthread and mplex can be 
>>modified only slightly to do just that and thus scales far better.
> 
> 
> i think you and i are barking up two slightly different trees here.
> 

We're both discussing performance improvements in nagios, right?

> what i've been trying to argue is that checks via functions will prove
> to be much better performing than executing plugins via fork/exec
> or popen.

Which is a tiny, tiny gain (< 20ms / check) for a LOT of hacking.

>  sure, a multithreaded architecture will also yield better
> results (even more so, but also more work to overhaul), but that's kind of
> orthogonal to what i'm getting at.  
> 

If plugins are to be run in serial you can save all the millisecs you 
want on each invocation and performance will still suck. Your model of 
dlopen()'ing plugins MUST be coupled with some form of multithreading 
mechanism (either fork() or pthread_create()) which makes the entire 
notion somewhat ridiculous for plugins that are used less frequently 
than say once every five seconds. The only plugin I can think of that 
falls into that category would be a ping check and possibly interface 
related snmp-checks for switches and routers in really huge networks.

I'd be happy to revise my opinion on this if you were to provide a PoC 
that shows that dlopen() works significantly (5% or more) faster than 
the popen() approach while running checks in parallell and reaping the 
results in a satisfactory manner (1 line of output, no leaks, get 
return-value). It would also be nice if the current level of protection 
remains (plugins can sigsegv without affecting the main process). So far 
I've seen nothing that even comes close to this.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20




More information about the Developers mailing list