RFC/RFP Nagios command workers

Matthieu Kermagoret mkermagoret at merethis.com
Mon May 23 11:37:24 CEST 2011


On Wed, May 18, 2011 at 4:43 PM, Andreas Ericsson <ae at op5.se> wrote:
> Since discussion on the last requests for comments and patches has
> splintered off and gotten somewhere, it's time for the next mail in
> the series of what us awesome gods of the Nagios core decided to
> work on for the next grand version of Nagios.
>

Congratulations ! I'm glad to see Nagios' development moving forward !

> This idea comes from Shinken, mod_gearman and DNX which have all
> implemented versions of it, so creds and kudos to the authors of
> those projects.
>
> Currently, Nagios eats quite a lot of I/O when writing, scanning for
> and reading the check result files. This becomes especially noticeable
> in large installations. There's also the problem of Nagios using a
> lot more copied memory per fork than it's supposed to, and the fact
> that embedding scripting languages inside the Nagios core to speed
> up execution is a potentially disastrous action (as the debacle with
> embedded Perl has proven to be).
>

Good analysis to which I totally agree.

> The idea to solve all of that is to fork() off a set of worker
> threads at startup that free()'s all possible memory and re-connects
> to the master process via a unix domain socket (or network socket
> that by default only listens to the localhost address) to receive
> requests to run commands and return the results of those commands.
>

While I agree that distributing check execution among multiple
processes can be a really good idea, I don't know if this should be
implemented in the Core. This can add significant complexity to the
code while not being useful to all Nagios users. The Core already have
a proper API that allows modules to execute checks themselves, so why
not rely on it for distribution and improve the existing command
execution mechanism ?

As you say, one of the root problem of the current implementation, is
the use of temporary files, as this consumes much I/O when writing,
scanning and reading them. Also the Nagios Core process is fork()ed
multiple times and this might consume unnecessary CPU time. So I
propose the following :

1) Remove the multiple fork system to execute a command. The Nagios
Core process forks directly the process that will exec the command
(more or less sh's parsing of command line, don't really know if this
could/should be integreted in the Core).

2) The root process and the subprocess are connected with a pipe() so
that the command output can be fetched by reading the pipe. Nagios
will maintain a list of currently running commands.

3) The event loop will multiplex processes' I/O and process them as necessary.

> This has several benefits, although they're not immediately user
> visible.
> * I/O load will decrease significantly, leaving more disk throughput
>  capacity for performance data graphing or status data database
>  solutions.

Still holds but to a smaller extent, as the "problem of Nagios using a
lot more copied memory per fork than it's supposed to" is not solved.
This could be solved with a module however, see below.

> * Scripting languages can be embedded regardless of memory leaks and
>  whatnot, since worker daemons can be killed off and respawned every
>  50000 checks (or something), thus causing the kernel to clean up
>  any and all leaked memory.

There could be modules that override checks and forward them to
interpreter daemons on a per-language basis for example.

> * Nagios core can be single-threaded, which means higher portability,
>  less memory usage and more robust code.

Still holds.

> * Eventbroker modules that use a socket to communicate with an external
>  daemon can instead register a handler for inbound packets and then
>  simply "own" that connection and get all future packets from it
>  forwarded as eventbroker events. This will ofcourse reduce the module
>  complexity quite a bit for nearly all much-used modules today (Merlin,
>  livestatus, DNX, mod_gearman, NDOUtils, etc...)

Still holds, instead of multiplexing on socket FD, multiplex on pipe FD.

> * It becomes possible to receive responses from Nagios when submitting
>  commands (the current FIFO pipe is one-way communication only).
>

See discussion about the command pipe below.

> Drawbacks:
> * It's quite a large and invasive change to the nagios core which
>  will require a lot of testing.
>

This would be a less invasive and smaller change but would still
require testing ;-)

The worker system could still be implemented and used only by users
who need it (but that's what DNX and mod_gearman do). I believe it is
better to leave the default command execution system as simple as it
is right now (but improve it) and leave distribution algorithms to
modules. I can imagine multiple reasons for which one would want to
distribute checks among workers :

  - less overhead per fork() (the problem you raised)
  - embedded interpreter (your raised this also)
  - per network (the worker closer to a node execute its check)
  - randomly (clustering)
  - ...

So I don't know if embedding a particular policy within the Core is a
good thing. I'd rather see an official module (that might be included
by default) for the workers system.

> Please note that a compatibility daemon which continues to parse the
> simple FIFO will ofcourse have to be implemented so that current scripts
> and whatnot keep on working, and the API to scan for and read check
> result files will also remain for the foreseeable future, although
> possibly implemented as an external helper program which can ship
> check results into the Nagios socket instead.
>

So in fact you plan removing the old FIFO and doing all stuffs through
the socket ? What about acknowledgements or downtimes ? Could they be
sent through the socket too or would there be another system ?

Best regards,

-- 
Matthieu KERMAGORET | Développeur

mkermagoret at merethis.com

MERETHIS est éditeur du logiciel Centreon.

------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Users mailing list