RFC/RFP Nagios command workers

Matthieu Kermagoret mkermagoret at merethis.com
Tue Jun 28 17:13:56 CEST 2011


Hi list,

First of all, sorry for the delayed response, last month was pretty
crazy at work :-p

On Mon, May 23, 2011 at 12:38 PM, Andreas Ericsson <ae at op5.se> wrote:
> On 05/23/2011 11:37 AM, Matthieu Kermagoret wrote:
> Because shipping an official module that does it would mean not only
> supporting the old complexity, but also the new one. Having a single
> default system for running checks would definitely be preferrable to
> supporting multiple ones.
>

I agree with you when you say that a single system is better than two.
However I fear that the worker system would need very more code than a
simpler system (and less code usually means less bugs) and that the
worker system would destabilize Nagios. For years it's been Nagios'
development team's policy not to include features that could be
written as modules. I liked it that way.

>> 1) Remove the multiple fork system to execute a command. The Nagios
>> Core process forks directly the process that will exec the command
>> (more or less sh's parsing of command line, don't really know if this
>> could/should be integreted in the Core).
>>
>
> This really can't be done without using multiple threads since the
> core can't wait() for input and children while at the same time
> issuing select() calls to multiplex the new output of currently
> running checks.
>

What about a signal handler on SIGCHLD that would wait() terminated
process and a select() on pipe FDs connected to child processes, with
a timeout to kill non-responding checks ?

>> 2) The root process and the subprocess are connected with a pipe() so
>> that the command output can be fetched by reading the pipe. Nagios
>> will maintain a list of currently running commands.
>>
>
> Pipes are limited in that they only guarantee 512 bytes of atomic
> writes and reads. TCP sockets don't have this problem. There's also

It is my understanding of Posix that the core standard defines a
512-byte minimal limit for atomic I/O operations but I cannot find any
section enforcing atomicity on I/O operations on TCP sockets, so pipes
would be better indeed. Were you refering to the XSI Streams or could
you point me to the appropriate section ?

> the fact that a lot of modules already use sockets, so we can get
> rid of a lot of code in those modules and let them re-use Nagios'
> main select() loop and get inbound events on "their" sockets as a
> broker callback event. Much neater that way.
>

A pretty API would definitely be great, no doubt.

>> 3) The event loop will multiplex processes' I/O and process them as necessary.
>>
>
> That's what the worker processes will do and then feed the results
> back to the nagios core through the sequential socket, which will
> guarantee read and write operations large enough to never truncate
> any of the data necessary for the master process to do proper book-
> keeping.
>

I'm not very fond of the "large buffer" approach because I'm not
really sure that I/O operations are atomic (see above).

>> The worker system could still be implemented and used only by users
>> who need it (but that's what DNX and mod_gearman do). I believe it is
>> better to leave the default command execution system as simple as it
>> is right now (but improve it) and leave distribution algorithms to
>> modules. I can imagine multiple reasons for which one would want to
>> distribute checks among workers :
>>
>
> The only direction we can improve it is to remove it and rebuild it.
> Removing one fork() call would mean implementing the multiplexer
> anyway, and once that's done we're 95% done with the worker process
> code anyways.
>

I guess that you shouldn't be far from a complete worker system but
(again :-) ) writing it as a module wouldn't be difficult either.

>> So in fact you plan removing the old FIFO and doing all stuffs through
>> the socket ? What about acknowledgements or downtimes ? Could they be
>> sent through the socket too or would there be another system ?
>>
>
> They could be sent through the socket, but for a while at least we'll
> have to support both pipe and socket, so addon developers have some
> time to adjust to the new world order.
>

All right, thanks for the explanations. A status returned by Nagios
after any external command execution would be a nice feature indeed.

Best regards,

-- 
Matthieu KERMAGORET | Développeur

mkermagoret at merethis.com

MERETHIS est éditeur du logiciel Centreon.

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Users mailing list