[Nagios-devel] RFC/RFP Nagios command workers

Andreas Ericsson ae at op5.se
Mon May 23 12:38:04 CEST 2011
Previous message: RFC/RFP Nagios command workers
Next message: Q: Service Escalation Recovery Notifications.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 05/23/2011 11:37 AM, Matthieu Kermagoret wrote:
> 
>> The idea to solve all of that is to fork() off a set of worker
>> threads at startup that free()'s all possible memory and re-connects
>> to the master process via a unix domain socket (or network socket
>> that by default only listens to the localhost address) to receive
>> requests to run commands and return the results of those commands.
>>
> 
> While I agree that distributing check execution among multiple
> processes can be a really good idea, I don't know if this should be
> implemented in the Core. This can add significant complexity to the
> code while not being useful to all Nagios users. The Core already have
> a proper API that allows modules to execute checks themselves, so why
> not rely on it for distribution and improve the existing command
> execution mechanism ?
> 

Because shipping an official module that does it would mean not only
supporting the old complexity, but also the new one. Having a single
default system for running checks would definitely be preferrable to
supporting multiple ones.

> As you say, one of the root problem of the current implementation, is
> the use of temporary files, as this consumes much I/O when writing,
> scanning and reading them. Also the Nagios Core process is fork()ed
> multiple times and this might consume unnecessary CPU time. So I
> propose the following :
> 
> 1) Remove the multiple fork system to execute a command. The Nagios
> Core process forks directly the process that will exec the command
> (more or less sh's parsing of command line, don't really know if this
> could/should be integreted in the Core).
> 

This really can't be done without using multiple threads since the
core can't wait() for input and children while at the same time
issuing select() calls to multiplex the new output of currently
running checks.


> 2) The root process and the subprocess are connected with a pipe() so
> that the command output can be fetched by reading the pipe. Nagios
> will maintain a list of currently running commands.
> 

Pipes are limited in that they only guarantee 512 bytes of atomic
writes and reads. TCP sockets don't have this problem. There's also
the fact that a lot of modules already use sockets, so we can get
rid of a lot of code in those modules and let them re-use Nagios'
main select() loop and get inbound events on "their" sockets as a
broker callback event. Much neater that way.

> 3) The event loop will multiplex processes' I/O and process them as necessary.
> 

That's what the worker processes will do and then feed the results
back to the nagios core through the sequential socket, which will
guarantee read and write operations large enough to never truncate
any of the data necessary for the master process to do proper book-
keeping.

>> This has several benefits, although they're not immediately user
>> visible.
>> * I/O load will decrease significantly, leaving more disk throughput
>>   capacity for performance data graphing or status data database
>>   solutions.
> 
> Still holds but to a smaller extent, as the "problem of Nagios using a
> lot more copied memory per fork than it's supposed to" is not solved.
> This could be solved with a module however, see below.
> 

Not without the module also running external programs, which just means
more complexity inside the nagios core instead of less.

>> * Scripting languages can be embedded regardless of memory leaks and
>>   whatnot, since worker daemons can be killed off and respawned every
>>   50000 checks (or something), thus causing the kernel to clean up
>>   any and all leaked memory.
> 
> There could be modules that override checks and forward them to
> interpreter daemons on a per-language basis for example.
> 

Yup. I'd expect this to be a natural progression of how things work,
with Python being the first in queue to be embedded.

>> * Nagios core can be single-threaded, which means higher portability,
>>   less memory usage and more robust code.
> 
> Still holds.
> 

Nope. It fails for all modules that require constantly poll()'ed
sockets.

>> * Eventbroker modules that use a socket to communicate with an external
>>   daemon can instead register a handler for inbound packets and then
>>   simply "own" that connection and get all future packets from it
>>   forwarded as eventbroker events. This will ofcourse reduce the module
>>   complexity quite a bit for nearly all much-used modules today (Merlin,
>>   livestatus, DNX, mod_gearman, NDOUtils, etc...)
> 
> Still holds, instead of multiplexing on socket FD, multiplex on pipe FD.
> 

Worker processes will multiplex on pipe fd's. Nagios will just poll the
sockets of the workers (and modules) that have connected to it, and
that's basically it.

>> * It becomes possible to receive responses from Nagios when submitting
>>   commands (the current FIFO pipe is one-way communication only).
>>
> 
> See discussion about the command pipe below.
> 
>> Drawbacks:
>> * It's quite a large and invasive change to the nagios core which
>>   will require a lot of testing.
>>
> 
> This would be a less invasive and smaller change but would still
> require testing ;-)
> 
> The worker system could still be implemented and used only by users
> who need it (but that's what DNX and mod_gearman do). I believe it is
> better to leave the default command execution system as simple as it
> is right now (but improve it) and leave distribution algorithms to
> modules. I can imagine multiple reasons for which one would want to
> distribute checks among workers :
> 

The only direction we can improve it is to remove it and rebuild it.
Removing one fork() call would mean implementing the multiplexer
anyway, and once that's done we're 95% done with the worker process
code anyways.

>    - less overhead per fork() (the problem you raised)
>    - embedded interpreter (your raised this also)
>    - per network (the worker closer to a node execute its check)
>    - randomly (clustering)
>    - ...
> 

The network thing seems like a pretty obvious extension of this once
it's in place, but most likely using an external program that can
authenticate remote nodes and forward network events to the Nagios
core. Currently, the Merlin daemon does just that, but the module is
overly complex due to the fact that Nagios is multi-threaded and
lacks socket support. The same holds true for dnx and mod_gearman.

> So I don't know if embedding a particular policy within the Core is a
> good thing. I'd rather see an official module (that might be included
> by default) for the workers system.
> 

Again, that would only lead to us having to support two different ways
of running checks. I dislike that intensely.

>> Please note that a compatibility daemon which continues to parse the
>> simple FIFO will ofcourse have to be implemented so that current scripts
>> and whatnot keep on working, and the API to scan for and read check
>> result files will also remain for the foreseeable future, although
>> possibly implemented as an external helper program which can ship
>> check results into the Nagios socket instead.
>>
> 
> So in fact you plan removing the old FIFO and doing all stuffs through
> the socket ? What about acknowledgements or downtimes ? Could they be
> sent through the socket too or would there be another system ?
> 

They could be sent through the socket, but for a while at least we'll
have to support both pipe and socket, so addon developers have some
time to adjust to the new world order.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: RFC/RFP Nagios command workers
Next message: Q: Service Escalation Recovery Notifications.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list