RFC/RFP Nagios command workers

Andreas Ericsson ae at op5.se
Wed May 18 16:43:51 CEST 2011


Ahoy again.

Since discussion on the last requests for comments and patches has
splintered off and gotten somewhere, it's time for the next mail in
the series of what us awesome gods of the Nagios core decided to
work on for the next grand version of Nagios.

This idea comes from Shinken, mod_gearman and DNX which have all
implemented versions of it, so creds and kudos to the authors of
those projects.

Currently, Nagios eats quite a lot of I/O when writing, scanning for
and reading the check result files. This becomes especially noticeable
in large installations. There's also the problem of Nagios using a
lot more copied memory per fork than it's supposed to, and the fact
that embedding scripting languages inside the Nagios core to speed
up execution is a potentially disastrous action (as the debacle with
embedded Perl has proven to be).

The idea to solve all of that is to fork() off a set of worker
threads at startup that free()'s all possible memory and re-connects
to the master process via a unix domain socket (or network socket
that by default only listens to the localhost address) to receive
requests to run commands and return the results of those commands.

This has several benefits, although they're not immediately user
visible.
* I/O load will decrease significantly, leaving more disk throughput
  capacity for performance data graphing or status data database
  solutions.
* Scripting languages can be embedded regardless of memory leaks and
  whatnot, since worker daemons can be killed off and respawned every
  50000 checks (or something), thus causing the kernel to clean up
  any and all leaked memory.
* Nagios core can be single-threaded, which means higher portability,
  less memory usage and more robust code.
* Eventbroker modules that use a socket to communicate with an external
  daemon can instead register a handler for inbound packets and then
  simply "own" that connection and get all future packets from it
  forwarded as eventbroker events. This will ofcourse reduce the module
  complexity quite a bit for nearly all much-used modules today (Merlin,
  livestatus, DNX, mod_gearman, NDOUtils, etc...)
* It becomes possible to receive responses from Nagios when submitting
  commands (the current FIFO pipe is one-way communication only).

Drawbacks:
* It's quite a large and invasive change to the nagios core which
  will require a lot of testing.

I know some people I met in Italy have already volunteered to help
implementing and testing this (Hi Cheik), but it would definitely be
helpful to get feedback from module authors and users when making this
change to Nagios.

Please note that a compatibility daemon which continues to parse the
simple FIFO will ofcourse have to be implemented so that current scripts
and whatnot keep on working, and the API to scan for and read check
result files will also remain for the foreseeable future, although
possibly implemented as an external helper program which can ship
check results into the Nagios socket instead.

Comments, patches and (before summer's out) testing is very much
appreciated.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay




More information about the Users mailing list