Nagios event broker architecture comments

John Calcote john.calcote at gmail.com
Tue Jan 22 23:35:40 CET 2008


To the Nagios developers,

Well, I said in a previous message that I'd write more about external
agents handling checks in Nagios. As a developer on the dnx project, I
can say truthfully that dnx has a vested interest in the general NEB
architecture. As I've worked on porting DNX to Nagios 3.x, I've found
a few architectural issues that I'd like to bring up.

Nagios makes it possible for a NEB module to override (using the
CALLBACKOVERRIDE return value) a service check at a key event that is
published by the event broker during the processing of a service
check. It's reasonably clear from comments in the Nagios code, that
Nagios developers intended for this return value to be used by NEB
modules to stop the processing of a check by Nagios so that the NEB
module could execute the check and post the results itself. Needless
to say, it's fairly significant to Nagios how the NEB module proceeds,
once it returns CALLBACKOVERRIDE.

If Nagios truly expects a NEB module to handle the processing of a
check, and the subsequent submission of check results, then Nagios
needs to make proper results submission possible. I posted a (updated)
patch yesterday that makes some minor changes to
nagios-3.0rc1/base/check.c. While these changes make it POSSIBLE for
DNX to handle a Nagios service check, and post proper results, I feel
like there are better ways to do it. Even with the patch I submitted,
the DNX NEB module must directly access various global variables in
the Nagios process space.

For one thing, DNX needs data from various fields of the global
check_result_info structure that is populated just before the broker
INITIATE event is published. These data items are essential to
submitting proper results info during handling of a service check.
They include the check_options, schedule and reschedule flags, and the
latency value. While both latency and check_options values can be
found in the service structure (which is made available to the event
handler), these fields do not contain the values actually written to
the results file during results submission, so the proper values must
be accessible to the event handler so that it can store them for later
results submission.

Another issue that I feel is important to address is the global symbol
space to which DNX and other NEB modules have access. DNX uses a
couple of "helper" functions provided by Nagios (unintentionally, I'm
sure). DNX uses escape_newlines and move_check_result_to_queue, during
results posting. It saves DNX duplicating a LOT of Nagios code while
submitting service check results. (These functions were really great
finds!) However, since posting results to Nagios is a critical bit of
functionality - both for DNX and for Nagios - it might be nice if
Nagios provided an actual API call that was published as part of the
NEB interface documentation. If a NEB module intends to use
CALLBACKOVERRIDE, it should be able to publish the results of an
overridden service check in exactly the same way that Nagios does.
This can be easily facilitated by having Nagios use the same API
function to publish results from checks that Nagios itself executes.

Sorry for the long-winded message. I just felt that these
architectural issues should be addressed. Comments would be very much
appreciated. :)

Regards,

John Calcote
Sr. Software Engineer
LDS Church, ICS Dept.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/




More information about the Developers mailing list