nagios as message log server

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Sun Feb 22 10:08:27 CET 2004


Dear Sir,

I am writing to thank you for your letter and say,

On Sat, Feb 21, 2004 at 08:39:09PM -0800, nagios-users-request at lists.sourceforge.net wrote:

> Message: 2
> From: "Neil" <neil-on-nagios at restricted.dyndns.org>
> Subject: [Nagios-users] Re: nagios as message log server
> 

  .. preamble snipped

> 
> 
> It's nice to have all the system/critical events from all over the 
> enterprise to be sent a central logging system,

yeah hup ! 

> in this case, nagios. But, 
> what I am worried now is that if we aren't
> actually monitoring a service, 
> but just waiting for a critical message in /var/log/messages or a critical 
> event sent by Snare for windows. 

  .. snip: synopsis of problem is event log entry raises CRITICAL status
           of corresp Nag service but how does it get set back to OK.

> 
> Since this isn't a service, I can't find a solution on how I can restore 
> back the state to OK.

Either Issue a 'submit a passive check result' from the Nag service
description panel

or,

Employ an Event Correlator (ie software that understands the
significance and relationships of messages _in_ time) such as Sec to
unlatch the CRITICAL after a sufficient interval following the CRITICAL
message (if that is the appropriate processing for the service. In fact,
if the service represents an IDS alarm you may not want to do this. In
any case, you are probably better off with having Nag treat the service
as "Volatile").

There is a significant difference between software like Swatch that
is mainly intended to react to patterns in log files and Sec that
understands that the pattern represents the start of __event
processing__ 

Examples of events (rather than patterns representing messages) are

. log message rate rises above threshold (eg SU failed on ...)

. log message rate falls below another threshold

. a log message followed by another one within no more than 't' seconds
of the first message (the mesasges can be completely different)

. all the log messages matching a pattern in an interval

None of these can be processed without referring to the time intervals
between the messages.

<off topic>

Here is a complete worked example of resetting a service with Sec.

1 Here is the Sec rule to process the events. The rationale is below.

type=SingleWithSuppress
ptype=RegExp        
pattern=Authentication Failure Trap .+?IpAddress: (\S+)
desc=Authentication traps
action=assign %a $1;                                                                                                                    
   eval %n ( $a = '%a'; %%hn = ('10.a.b.c1' => 'foo',
                                '10.a.b.c2' => 'bar',
                                '10.a.b.c3' => 'blech',
                                '10.a.b.c4' => 'baz',
                               );
             exists $hn{$a} ? "$hn{$a}/$a" : "Unknown hostname/$a" ;);                                                               
   eval %o ('Trap from %n. Print spooler may be scanning all addresses 
with Snmp to discover an offline printer.') ;                 
   assign %h tsitc;                                                                                                                  
   delete auth_traps_seen;                                                                                                           
   create auth_traps_seen 960 ( assign %o No auth traps caused by %n for 
last 16 minutes.;                                      
                                write  
/usr/local/nagios/var/rw/nagios.cmd ([%u] 
PROCESS_SERVICE_CHECK_RESULT;%h;%s;0;%o);      
                              ) ;                                                                                               
   write  /usr/local/nagios/var/rw/nagios.cmd ([%u] 
PROCESS_SERVICE_CHECK_RESULT;%h;%s;2;%o)
window=900
       

The intent is to process a stream of 'Authentication failure Trap' 
messages by

1 At the start of the stream, inform the Nagios service monitor 
(writing a formattted Nagios 'passive service check result' to the 
Nagios command input fifo) that the Nagios service corresponding to 
these traps is 'critical' (the ';2' in the last sec write comand).

2 While the stream continues, inform Nagios each 15 minutes after the 
last notification.

3 If there is no Auth Trap message for 16 minutes after the last one,
inform Nagios that the 'Nagios service' (corresponding to the traps) is
OK (the ';0;' in the first write command).

(The rule uses a Perl mini-program to map the IP address of the cause of
the Auth Trap to a few apriori known culprits (qw(foo bar baz)).

The rationale is

SingleWithSuppress compresses the trap messages into one write() each 15 
minutes.

After the last Sec action (write) then
1 if there is another trap within 15 minutes   => rule fails: no write
2 if there is a trap between 15 and 16 minutes => rule matches: write
3 if there are no traps for 16 minutes         => rule context expires:
                                                  write OK.

</off topic>

Yours sincerely.




-- 
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------

'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'

from Meditation 17, J Donne.


-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list