Adding more advanced correlation to nagios with sec (any interest?)

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Fri Jul 11 07:04:16 CEST 2003


Dear Sir,

I am writing to thank you for your well conceived and expressed letter 
and say,

On Sat, Jun 28, 2003 at 03:48:16PM -0400, John P. Rouillard wrote:

> However, I have some things that I want to do that are not easily
> done within nagios. E.G.
> 
>    If a system jumpstart is in progress, ignore warnings about high
>      interface usage (on one interface), or dropped packets (on the
>      hub).
> 
>    If an index operation of the HTTP server is in progress, ignore
>      warnings about the http interface being slow.
> 
>    I also want to show a host/service down if a given system went down,
>      (as determined by a syslog message) but I want the report done
>      ONLY if the system isn't back up in 5 minutes.
> 
> Note that none of the rebooting, indexing, or jumpstarting operations
> occur at fixed times, so I can't schedule these in advance.
>

that this, as you say, demonstrates the case for Nagios being able to 
provide better event correlation than it does now. 

However, please would you spell what events and their origin are 
correlated by Sec to avoid spurious alarms in these cases - especially 
the first two. Is Sec correlating plugin failures with syslog messages ?
 
> Other things can sort of be done in nagios, but it is a bit tough to
> configure. E.G. I have a single snmp_trap service defined for my
> hosts. The service is considered volatile, and is state_stalked.  I
> want to do the following:
> 
>    If an (particular range of) interfaces on a switch goes down (and
>      sends a trap) ignore it unless it has gone down/up 3 times in
>      five minutes. Don't clear it until it has stayed up for at least
>      15 minutes.
> 
>    Other interfaces on the same switch should be reported immediately
> 
> I could do part of this by adding every one of my 20 interfaces on the
> switch as services, but that doesn't really handle the timing aspects.
> It makes the services a lot more difficult to read and configure.
> 
> Another thing I want to do is:
> 
>    Synthesize an event that notes if two of my three links to
>      a remote site are having problems. That is two of my three
>      routers may be in a warn state, and I want to place the
>      "Access to 16 net" service in a critical state.
> 
> This can be done by event handlers, but you end up writing a portion
> of sec to do it, so you might just as well use sec in the first place.
> 

Agreed.

> I have a method of integrating sec <http://www.estpak.ee/~risto/sec/>
> into nagios to handle these issues and more.
> 
> Using sec to process traps (or other passive checks) is straight
> forward. The trap collector running from snmptrapd just dumps the trap
> report (formatted as a nagios passive service check) into sec's input
> fifo and then sec processes it, and reports it (if needed) into the
> nagios.cmd pipe.
> 

And a very attractive means of handling SNMP traps it is too.

Sec has become for me, the standard way of providing event and trap 
handlers.

For example, I have a general host and service handler that updates a 
MySQL DB with the outage interval. To do this it must retain state (and 
does so with a Perl hash tied to a DB file) so it can determine if there 
has been a transition and if so, how long it was.

This would probably be easier to do with Sec contexts.

> However for polled items, it more difficult. I don't want to have a
> flapping service where the plugin determines that there is a problem,
> nagios reacts to that, and then sec reacts to that (being fed its info
> by an event handler) by clearing the service because sec determines
> that there is not yet a problem. This leads to a flapping service as
> nagios and sec disagree on what is a true problem, and leads to
> spurious notifications because I can't put in a high
> max_check_attempts and have nagios respond to sec when it has a real
> problem (unless I define yet another service yech).
> 
> What I did was write a plugin in perl (sec_filter) that runs the
> nagios command (sort of like check_ssh). It always passes the output
> of the plugin to sec's input pipe.  However, depending on the flags
> given to the sec_filter script, it will exit:
> 
>     with an "ignore OK" code, and no output
>     with an "ignore ERROR" code, and no output
>     with the exit code and output of the plugin
> 
> I have chosen exit status of 5 for "ignore OK" and 6 for "ignore
> ERROR". (It looks like code 4 is used internally for pending states,
> and I didn't want to use that number hence my choice of 5 and 6.)
> 
> The reason for these new codes is to make nagios not change any status
> for the polled service based on the poll. The new status will be sent
> to it by a passive check command generated from sec.
> 


> That is I want nagios to be a (almost) dumb poller and to let sec
> filter all the data. 

If I understand correctly, the proposal is

1 When Nag schedules a service check, of any and all service checks, it
in fact execs sec_filter with the real plugin name and flags that
determine sec_filters behaviour by allowing it to either

 1.1 treat the service as a normal Nag service (a 'polled' service, for 
     which no event correlation by Sec is necessary)

 1.2 treat the service as requiring Sec processing to accurately 
     determine the service state. Sec will get the plugin output and
     use this with other Sec inputs and Sec context to determine the 
     service state

2 Sec_filter writes

 2.1 For those services requiring Sec,

   2.1.1 An event to Sec

   2.1.2 One of the new status codes to Nagios

 2.2 Otherwise, in the case of 'polled' services, the usual Nag status 
     codes and plugin output are written to Nags input queue

3 Nag processes former status codes with no changes (ie CRITICAL leads
to the check being repeated retry_interval and if the state persists to
Notification), but those with the new code of IGNORE_ERROR are
recognised as requiring retry at the retry_interval but _no_ other
processing.

4 Sec will eventually submit a PROCESS_SERVICE_CHECK_RESULT to the Nag 
input queue (for the services that have formerly been reported as 
IGNORE_\w+.

Is this correct ?

My remarks are

1 This _may_ be better done in the Nag core. Nag could be equipped with
configuration directives for Sec processing so that Nag itself could
submit the event to Sec (rather than the plugin sec_filter). This 
saves an extra fork.

2 I am not sure how your proposal relates to the embedded Perl stuff 
(where each plugin is called as a function from the Nagios address 
space).

This is probably trivial since sec_filter simply becomes another Perl
plugin that Nag calls (and sec_filter 'requires' the real Perl plugin so
that re-compilation of the real plugin is avoided

3 I like the bit about making Sec processing optional (depending on the 
options specified to sec_filter)


> Using sec provides much better control over flap
> detection, and multiple service correlation. Above I said I wanted
> nagios to be an almost dumb poller. This is because I want nagios to
> poll at the retry_check_interval if there is a problem found by the
> plugin. If sec_filter exits with status 6, then nagios polls at the
> faster retry interval. This allows sec to better determine the trouble
> the system is in, or more easily determine when the system recovers.
>


For me, I am quite happy with Nags processing of most services. I can't 
say that the scenarios you mention are problematic for me. However, I 
would very much like the option of event correlation when required.
 
> I have set it up so that sec itself is a passive nagios service, and
> automatically sends notifications to nagios, as well as nagios being
> able to poll the sec service if its data gets stale.


> 
> So is anybody interested in my mods (about 30 lines) to nagios to
> support this, and my plugin?

This needs the comment of the Nagios developer. It sounds attractive to 
me however.

I am sorry if these remarks are stupid or based on misunderstanding. I 
think I would need to see the mods for a better (marginally) response.

It may simply be worth posting them to Nagios-Devel. AFAIK this is not 
on the Nag road map so it simply may be a golden opportunity for a big 
benefit.

Finally, you have identified a good area for future development. Root 
cause analysis and event correlation is one area that commercial 
products can claim superiority. 

Thank you very much.

> 
> Note, there is a issue with sec in that ;'s can't be embedded in its
> action commands. This is a problem since nagios' passive commands are ;
> delimited. There should be a new version of sec out (2.1.8) once
> testing is complete that addresses this issue.
>

As you say, this as been dealt with to my satisfaction in 2.1.8.
 
> 				-- rouilj
> John Rouillard
> ===========================================================================
> My employers don't acknowledge my existence much less my opinions.
> 

Yours sincerely.

-- 
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------

'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'

from Meditation 17, J Donne.


-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps1




More information about the Users mailing list