Unclear on mapping of passive checks to state changes Was: dry alarm contact monitor.

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Sun Feb 8 22:18:10 CET 2004

Previous message: Add Delay to service checks
Next message: Unclear on mapping of passive checks to state changes Was: dry alarm contact monitor.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Sir,

I am writing to thank you for your letter and say,

On Sat, Feb 07, 2004 at 08:26:01PM -0800, nagios-users-request at lists.sourceforge.net wrote:

> From: Skip Montanaro <skip at pobox.com>
> Date: Sat, 7 Feb 2004 06:32:50 -0600
> To: nagios-users at lists.sourceforge.net
> Reply-To: skip at pobox.com
> Subject: [Nagios-users] Unclear on mapping of passive checks to state changes
> 
> 
> I'm more than a bit unclear on how passive checks of stuff like logfiles are
> supposed to work.

First this Nagios setup uses Sec mediated passive checks of the _log_ of
SNMP traps (logged by Net-SNMP/snmptrapd).

IMHO this is a superior method to using trap handlers because it is more
scalable (not requiring a new handler per trap and so on) and more
powerful (able to process events based on rules).

Monitoring log files works like this this

1 Daemon (Sec) _or_ code like check_log 'tails' (reads the records from
where it last finished reading to the end of file) the log file and
recognises any records of interest

2 This code (the daemon or the custom check or check log) determines
through some internal logic such as pattern matching whether there has
been a state change, whether this is a duplicate record, and whether any
state change is significant.

This identification of the signficance of the log events based on
context and time intervals is the critical function, and why a rules
based approach is so useful.

With Sec for example, one can ignore, pairs of events that occur within
a time window, so that one can filter down/up pairs provided they happen
quickly. Otherwise an unmatched down leads to

3 This code generating a well formed Nagios event and injects it into
the Nag command file

(see the docs for examples of how to do this. The
code must 
3.1 associate this record with a host [either from a field in the record
or a priori - all this class of events belong to this host
3.2 associate this record with a service [as above]
3.3 format the Nagios event eg
3.4 write it to the pipe eg
write  /usr/local/nagios/var/rw/nagios.cmd ([%u]
PROCESS_SERVICE_CHECK_RESULT;%h;%s;0;%o);

3.5 change any necessary context

)

> If I monitor a condition actively (let's say, whether or
> not my web server is responding), that condition remains present and the
> state remains at critical or warning until the problem causing it is
> resolved (server restarted, network problem resolved, etc).
>

> If I'm monitoring a logfile and see
> 
>     2004-02-07 door opened by skip without auth code
> 
> a passive check might report that condition.  How is that turned into a
> warning or critical condition?

in code 

# $host == either this host or host to which this events will be
# reported as belonging to. The problem of associating hosts with 
# records is up to you - either log the address or use some other
# means. NB You _also_ must convert addresses to Nag host names

# $service is likewise a policy matter eg Name this one
# 'Security door alarm'.

if ( /door opened by (\w+) without auth code/ ) {
  syswrite NAG_PIPE,
'[' . time() . ']' 
. " PROCESS_SERVICE_CHECK_RESULT;$host;$service;2;Door opened by
$1 without auth.", 512 ;
}

> That's presumably only going to occur one
> time though.  Since it's a report of a one-time event, not a long-term state
> change to the system, how is the system supposed to know when to change the
> state back to ok?
>

It doesn't. Passive service check results latch the service state,
although there are Nag options (volatility for example) to modify this.

Your monitoring either resets the service - in this case, the rules
would have to allow for an interval to elapse before issuing an OK - or
you do so (submit passive service check result).

Here's an example

A certain famous brand print spooler, when it finds that a printer goes
off line, starts scanning its subnet for the printer with snmp (in case
the printer may have changed its address).

This causes network devices to issue SNMP Auth fail traps.

This installation detects those traps and sends a critical to the
service 'Authentication traps' on the print spooler host.

We reset this manually (after ensuring nothing bad is happening) but we
will change this to reset automatically after an interval of no traps.

Have a look at Sec if you need some encouragement about the power of
rule based processing. 

> Thx,
> 
> -- 
> Skip Montanaro
> Got gigs? http://www.musi-cal.com/submit.html
> Got spam? http://spambayes.sf.net/
> skip at pobox.com

Yours sincerely.

-- 
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------

'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'

from Meditation 17, J Donne.

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Add Delay to service checks
Next message: Unclear on mapping of passive checks to state changes Was: dry alarm contact monitor.
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list