How to handle variable periods of relevance for passively monitored services?

Anthony anthony-nagios at hogan.id.au
Sun Sep 6 07:07:50 CEST 2009


Hi all,

At work we have a Nagios setup with about 40 hosts and 150 services.

Of those 40 hosts, about 15 are workstations which operators use dependent
on their shift and whilst an operator is working on a given machine
performing particular tasks, specific software needs to be running (Third
party client software etc. that feeds data into the systems we use). I
monitor things like how many instances are running and if the particular
piece of software is generating the expected output, whether expected
services are running, if there's enough free disk space and CPU utilisation
etc. etc.....

If an operator accidentally starts multiple copies of some of the software,
or a phantom copy is running in the background (occasionally GUIs crash
leaving background processes running causing all sorts of gremlins), it's
handy to know that they're running outside of normal bounds and allows me
help diagnose any problems. That or if they're about to run out of disk
space due to some rogue logging process.

On the days where a given operator is not working, their particular system
may be switched off or if it's on, certain services may not need to be
running.

To overcome firewall issues (the systems are spread across several states)
they all tend to push passive test results back to the central Nagios
server.

This means, on any one day, it's likely that a particular host is either
switched off or not running all its services that it would be during an
active day, as its operator is not rostered on that day... and I get a sea
of red in Nagios which leads to Chernobyl issues (the important alarms not
standing out above the ones that are "ok to be critical")..

Now, service check time periods only apply to active service checks, not
passive service checks.

How does one get around this situation of variable periods of relevance for
passively monitored services?

My thoughts were that perhaps I needed to create an additional web interface
for operators to say when they were using a particular machine and what for,
and behind the scenes this would send the relevant external commands to
Nagios to do things like setting an OK state and disabling further passive
checks across the host.. or doing this to individual services... but I
wondered if there was a cleaner way to do this?

That or perhaps somehow creating a service controlled by users somehow which
indicated whether they were active or not, and then dependent on the state
of this service, not caring about the state of "dependent services".

I know generally Nagios is geared towards monitoring the traditional concept
of a server and service - always on 24x7 or at otherwise fixed, inflexible
intervals.. but unfortunately the environment I work in is presently a lot
more dynamic than that.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090906/49fed3c9/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list