Problems with nrpe2 signals and plugin cleanup

Thomas Guyot-Sionnest thomas at zango.com
Tue Feb 26 17:07:43 CET 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bill Moran wrote:
> In response to Thomas Guyot-Sionnest <dermoth at aei.ca>:
> 
>> On 25/02/08 04:17 PM, Bill Moran wrote:
>>> First, does anyone have a suggestion on how to handle this better
>>> in the script?
>> You should set an alarm and handle it yourself. You could for example
>> have your script timeout by itself after 300 seconds, and NRPE
>> terminating the script after 350 seconds (or more if it may take longer
>> to cleanup). See what Perl plugins do for example...
> 
> This makes no sense to me.  If I'm going to repeat timeout functionality
> in every plugin I write, why would I use Nagios at all?  Might as well
> write everything myself and have each script generate an HTML page as
> output ...

This is how every plugin works. It isn't hard either, here's all you
have to do in Perl for example:

$SIG{'ALRM'} = sub {
	# do cleanup here, or call a cleanup function.
        print "UNKNOWN: Plugin timed out\n";
        exit 3;
};

alarm(300);

# Do some stuff here

alarm(0);

Doesn't sound much complicated to me. Every plugins that do blocking
operations should do this. If you don't please don't come back here and
complain when it breaks - as I said this is by design.

> That probably sounds ridiculously extreme, but my point is that plugin
> timeout is something that every plugin needs.  It doesn't seem like
> proper design to force every plugin to handle it with it's own magic
> when POSIX has a signalling methodology that allows it to be centralized
> in the framework.
> 
> Seems like a lot of redundant code.  It'll get my problem solved for
> the time being, but I find it a hack, not a solution.

Three reasons against it:

1. That means the calling application must track childs and their
timeout to kill them. It can make things more complicated, especially if
you want to give a grace time for cleanup between the TERM and KILL.
2. Some plugins don't need timeouts.
3. Most plugins let you set the timeout, so the job is already done. Why
do it twice? I certainly don't want a single timeout value for all plugins!

Also implementing timeouts in plugins is very simple. Look above: +7
lines that can easily fit around any code. Also since you already need a
signal handler on yours, that make only 2 more lines!

>>> Second, I'm curious about the rapid issuance of the TERM/KILL
>>> signals.  Is there anything preventing nrpe2 from simply sleep()ing
>>> a few seconds between the two signals?  I mean, if I'm willing to
>>> wait 300s for success, I'm willing to wait 305s for a clean failure.
>> While I agree it doesn't make much sense to TERM and KILL right after,
>> the only thing I'd do is remove the TERM. Nagios plugins by design must
>> not run indefinitely, so NRPE isn't different. If you sleep between
>> both, then how long should it be? This raise many issues, so it's better
>> to stick with plugins doing their own timeouts.
> 
> How many issues does it raise?  The only one I'm seeing is the "how long
> do you sleep between signals" issue.  If there's something I'm missing,
> feel free to enlighten me.

Plugin migration to the new system, support by 3rd party apps, user
confusion... I'm sure we can find others.

> The "how long do I sleep" issue is minor.  I can think of two happy
> solutions:
> 1) Add another configuration option to both Nagios and NRPE.

I see both coding and admin work duplication there... Only to make a
fraction of plugins (the few custom ones that don't timeout as the specs
says) be happy... And since you need an alarm handler anyways, and
that's most of the work required, what does it solve? two alarm() calls
in your code?

> 2) (better) make the timeout some fraction of the overall timeout.
>    How about t/20+1 ... which means a standard 10s timeout results
>    in a 1s wait between term and kill, but a 300s timeout results
>    in a 16s pause, hardly unreasonable for a plugin that's expected
>    to take up to 300 seconds for success.

Yeah, and what about the plugging that will require more? Then someone
manage to do that and will come here to say this should be
configurable.... Still to avoid only two alarm() calls!

> If there are other issues I'm missing, I'd love to be enlightened, but
> the "what should the timeout be?" issue sounds more like an excuse than
> an actual design challenge.

Well, if you feel that this must be changed I suggest that you write
patches to Nagios (make sure to implement per-plugin timeouts!),
Nagios-plugins (make all plugins using the parent signal), NRPE,
nsclient, nsclient++, NC_Net, DNX, etc. to work that way. Then once
you're done and managed to have enough testers to show this doesn't
break anything, including special custom configurations, etc, try to
convince most of the community that they need that change and that they
should all spend time to change every components on the next plugins
upgrade and configure the additional parameters.

Obviously you don't need to change the plugins right away, but then why
have both systems? It will confuse users and it's very unlikely that
every programs running Nagios plugins will implement this new timeout
method; and until they all do there will be no reason for a plugin
writer to implement it since it will have to support the current way
(which has no known issues btw).

> Just my opinions, I suppose.  Thanks for the helpful feedback.

Well, the main point here it's that it doesn't fix anything and
implementing timeouts in plugins isn't that hard.

I think a better "solution" for what you're trying is a wrapper similar
to "negate" (Nagios-plugins) that would take a timeout and signal
parameters and send the signals when timeouts occur. Feel free to add a
feature request to Nagios-plugins trackers for that, but don't expect
anything soon.

- --
Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHxDlP6dZ+Kt5BchYRAs5iAJ4qtOK33D/sJBiPZDn55MXcgfz5pwCg6tmY
nCp3aFBfB7V0Pvo4WQf2WRo=
=uWvg
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/




More information about the Developers mailing list