Nagios acknowledgement enhancement request

Thomas Guyot-Sionnest dermoth at aei.ca
Thu Nov 13 19:30:31 CET 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jim Winkle wrote:
> On Wed, 12 Nov 2008 at 6:22pm, Thomas Guyot-Sionnest wrote:
>> On 12/11/08 04:45 PM, Jim Winkle wrote:
>>> Hi,
>>>
>>> I have a suggestion for a future enhancement of Nagios.
>>>
>>> In short, I'd like there to be a way to have Nagios send notifications
>>> until we acknowledge a problem -- for certain unique plugins -- without
>>> ignoring future problems. Background and more details follow.
>>>
>>> We're using the check_logfiles plugin to monitor syslogs (e.g.
>>> /var/adm/messages on Solaris). check_logfiles returns CRITICAL when it
>>> detects a problem, but then normally clears itself (returns OK) the next
>>> time it runs.  Nagios notifies us only once under this scenerio, and since
>>> it's possible that pagers might miss just one page (paging services aren't
>>> 100% reliable), we'd rather get notified until we explicitly acknowledge
>>> the problem.
>>>
>>> The check_logfiles plugin does have the capability to continue to report the
>>> error (using its "sticky" option). This is good since then we're notified
>>> longer, but if we then use the Nagios "Acknowledge" link to acknowledge the
>>> problem, new problems (e.g. new errors in /var/adm/messages) reported by the
>>> check_logfiles plugin get ignored.
>>>
>>> I asked on the nagios-users list if there was a way to acknowledge a problem
>>> reported by a plugin like check_logfiles without ignoring future problems.
>>> Nobody came up with a way, so I assume this is new functionality needed in
>>> Nagios.
>>>
>>> I realize we can syslog an "okpattern" string and check_logfiles will then
>>> clear, but I'm looking for something using the Nagios web (and external
>>> command_file) interfaces. Using the Nagios "Acknowledge" link would be ideal,
>>> since that's what folks are going to be using to acknowledge other problems.
>>>
>>> I'm using Nagios version 3.0.5 and check_logfiles version 2.4.1.3. We configure
>>> check_logfiles as a volatile service and use state staulking.
>>>
>>> Thanks for providing these great tools! Please let me know if something doesn't
>>> make sense or if I'm missing something.
>> You could most likely achieve what you want with adaptive monitoring.
>> When the service goes to HARD CRITICAL, run an event handler that change
>> the service command to a dummy critical check. To change it back you
>> could either submit a passive check that triggers the event handler to
>> re-apply the check command, or use a dummy contact whose notification
>> command do it upon receiving an acknowledgement.
>>
>>
>> Some useful links:
>> http://nagios.sourceforge.net/docs/3_0/eventhandlers.html
>> http://nagios.sourceforge.net/docs/3_0/adaptive.html
>> http://www.nagios.org/developerinfo/externalcommands/commandlist.php
> 
> Adaptive monitoring... interesting, dynamic... a little complicated, but
> I'll think about going that route.  Thanks for that response.
> 
> Nonetheless, it would still be cool if the Acknowledge function could
> handle unique plugins like check_logfiles. I can think of two ways this
> could be done:
> 
> 1) Using check_logfiles "sticky" option (the plugin continues to report
> the problem): If Nagios would store the string that the plugin returned 
> when a user clicks "Acknowledge", then if the plugin returns a *new* 
> CRITICAL string, Nagios would go thru it's notification routine, run event 
> handlers, etc. When the user again clicks "Acknowledge", Nagios stores this 
> new string (discarding the old) to be ready for the next problem. Pretty 
> simple from a user standpoint.

You should rather implement that in your plugin. You can easily pass the
check output and/or performance data back to the next check. I did it
for a Windows CSV perfmon log counter check to monitor incremental
counters (I think I forgot to release it... anyone interested?) and I'd
like to make something similar for check_snmp.

> 2) Not using check_logfiles "sticky" option (the plugin fires just once): 
> Create a new option like is_volatile called is_transient. Transient services 
> would differ from "normal" services in one important way: when they go into 
> a hard non-OK state, they are locked into this state until the problem is
> acknowledged, even though the plugin returns "OK".
> 
> Ideally, if a new problem occurs Nagios would go through it's notification
> routine (run event handlers, etc.) again, so that we're notified about *new*
> problems. So again, the string that the plugin returned would need to be
> stored.

Can't you implement that with event brokers and custom variables? That
would make more sense IMHO... Nagios's biggest strength is it's
simplicity and flexibility. There's so many different things you could
do, if Nagios supported them out of the box no one would be able to
understand how to configure it (you can still do it via adaptive
monitoring, right?).

It's also exponentially harder to understand every possible behavior as
you add more parameters, and this makes further modifications and tests
harder to perform. This would not really add a feature since you can do
it already and in the long term it will only hinder further development.

> Thoughts?

So, if you want to add a generic interface that would help you perform a
task (see my other reply to Andrew Ivanov), I don't think it's a bad
idea because you open many new possibilities while leaving the
implementation details to the user.

When you configure a specific task with adaptive monitoring, you have to
consider only the behavior of the specific service you're implementing
it with. It is much easier to do than understanding every possible
behaviors (volatile/active/passive checks, execution/notification
dependencies, escalations, timeperiods, etc!)

- --
Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJHHJH6dZ+Kt5BchYRAqthAKCEM7LV10XeXShlSsv6SYAuiGxUDACfa7RN
sUnBSGEbtf8qEqKK/dxHp7A=
=uhni
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list