Problems with nrpe2 signals and plugin cleanup

Thomas Guyot-Sionnest thomas at zango.com
Tue Feb 26 18:01:47 CET 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I won't reply to every comment you made because I don't think there's
any need to. The only think that would make sense IMHO is a wrapper app
as I suggested in the last paragraph of my previous post.

Bill Moran wrote:
> In response to Thomas Guyot-Sionnest <thomas at zango.com>:
> 
>> Bill Moran wrote:
>>> In response to Thomas Guyot-Sionnest <dermoth at aei.ca>:
>>>
>>>> On 25/02/08 04:17 PM, Bill Moran wrote:
>>>>> First, does anyone have a suggestion on how to handle this better
>>>>> in the script?
>>>> You should set an alarm and handle it yourself. You could for example
>>>> have your script timeout by itself after 300 seconds, and NRPE
>>>> terminating the script after 350 seconds (or more if it may take longer
>>>> to cleanup). See what Perl plugins do for example...
>>> This makes no sense to me.  If I'm going to repeat timeout functionality
>>> in every plugin I write, why would I use Nagios at all?  Might as well
>>> write everything myself and have each script generate an HTML page as
>>> output ...
>> This is how every plugin works. It isn't hard either, here's all you
>> have to do in Perl for example:
> 
> How do I do it in bash?  In C?

In C, you can look at what is being done in Nagios-plugins (anyways I
guess if you're writing it in C you have have enough knowledge to do it
and enough time to write one).

>> Doesn't sound much complicated to me. Every plugins that do blocking
>> operations should do this. If you don't please don't come back here and
>> complain when it breaks - as I said this is by design.
> 
> Does Nagios officially support non-perl plugins?

Nagios supports every plugins that follow the specs, regardless of the
language used. That's why they're here. And BTW the specs says your
plugins MUST timeout by itself promptly.

>>> That probably sounds ridiculously extreme, but my point is that plugin
>>> timeout is something that every plugin needs.  It doesn't seem like
>>> proper design to force every plugin to handle it with it's own magic
>>> when POSIX has a signalling methodology that allows it to be centralized
>>> in the framework.
>>>
>>> Seems like a lot of redundant code.  It'll get my problem solved for
>>> the time being, but I find it a hack, not a solution.
>> Three reasons against it:
>>
>> 1. That means the calling application must track childs and their
>> timeout to kill them. It can make things more complicated, especially if
>> you want to give a grace time for cleanup between the TERM and KILL.
> 
> Huh?  The code already does this now.

NRPE, I'm not sure of Nagios. Also it doesn't capture anything from the
plugins (some plugins returns something useful even after a timeout; for
example check_icmp will output the amount of packet loss after a timeout
(which always occur is not all packets went back). Your method would
requere a delay between SIGTERM (or anything else) and SIGKILL, and
capture the output before the KILL.

>> 2. Some plugins don't need timeouts.
> 
> Er ... that contradicts the published plugin design guidelines.

No. If you don't do any blocking call, why set up a timeout?

I know there's more, but look a check_dummy.c and check_cluster.c for
example. No timeout.

>> 3. Most plugins let you set the timeout, so the job is already done. Why
>> do it twice? I certainly don't want a single timeout value for all plugins!
> 
> You're doing it twice already.  As I said, both Nagios and NRPE have a
> timeout/kill mechanism already, all I'm suggesting is that it be made
> more useful.

If they do it's only to avoid stall processes. They kill and generate
their own return output. It's not configurable either , especially not
per-plugin.

>> Also implementing timeouts in plugins is very simple. Look above: +7
>> lines that can easily fit around any code. Also since you already need a
>> signal handler on yours, that make only 2 more lines!
> 
> What I got working in bash is a total of 5 lines, assuming the exit
> cleanup is simple.  I don't consider it a simple process, however,
> and its easy to do wrong.  Forget the >/dev/null 2>&1 and your
> script will never succeed, for example.
> 
> # Install signal handler:
> trap "rm -f $TMP1 $TMP2; echo 'WARNING - timeout'; exit 2" 1 2 3 6 15 
> 
> # Install the background timer.
> PARENT=$$
> (sleep 290 >/dev/null 2>&1 && kill $PARENT) &
> BGTIMER=$!
> 
> # Do actual plugin work here ...
> 
> kill $BGTIMER
> 
>>> How many issues does it raise?  The only one I'm seeing is the "how long
>>> do you sleep between signals" issue.  If there's something I'm missing,
>>> feel free to enlighten me.
>> Plugin migration to the new system,
> 
> Huh?  Plugins with existing timeout settings will still work.
> 
>> support by 3rd party apps,
> 
> I've no idea what the issues here might or might not be.

Why have both? It's confusing to users. People should especially don't
expect that to work everywhere because not all plugins executore might
work the same. Since the current one works, why change or add one?

>> user
>> confusion...
> 
> The change makes no difference to users who don't explicitly rely
> on it.  Only plugin writers are affected.  And they only way they're
> affected is to have a new feature they can use to terminate their
> scripts cleanly.

And where do users learn how to write plugins (and eventually become
plugin writers)? Most of the time they will look at another plugin.

>>> The "how long do I sleep" issue is minor.  I can think of two happy
>>> solutions:
>>> 1) Add another configuration option to both Nagios and NRPE.
>> I see both coding and admin work duplication there... Only to make a
>> fraction of plugins (the few custom ones that don't timeout as the specs
>> says)
> 
> Just a bit ago you were claiming that not all plugins need a timeout,
> now you're arguing that those plugins are out of spec.

Not out of spec. The spec says the plugin must return in a timely
manner. Timeouts explicitly does it. Plugins that can't block implicitly
does it.

> I've suggested a feature to make Nagios more useful.  If the politics of
> the project want to mire that down in http://orange.bikeshed.org/
> then I'm not going to stick around to fight, I have other things that
> need done.

This is a corner case (and you even showed that you can work around it),
and I don't believe it's worth changing Nagios, nrpe and other
executors' code to make it work. as I said at best a wrapped would be
enough and avoid any code change.

- --
Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHxEX66dZ+Kt5BchYRAlDTAJ4tc7wMJoNw+e1U6QLk9FI/e6+P4ACcCThg
U+6Vhn4LiYSDLk6jfn8HIWE=
=Uyl6
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/




More information about the Developers mailing list