Issues with 1.0b5

Brian Wilson wilson at unity.ncsu.edu
Thu Aug 22 16:22:16 CEST 2002


First off, I've been a longtime user of netsaint and my current network
status system is based off of a combination of netsaint and hp openview.
After learning about the nagios/netsaint spinoff, I decided to look into
it because of some critical features that were missing from netsaint:

First being the fact that nagios would retain host downtime on program
restarts.  This is critical and I got around this in netsaint with a
series of AT jobs.  I also like the config file changes.. I have scripts
that do all configuration file generation, and having this change was well
needed ( There is, however, a limitation as to how many devices you can
add to a group.. try adding 2,000 hosts to a group and see what happens..
group members should really be 1 per line instead of comma separated
  member switch1
  member switch2
  member switch3
  etc..
 ).

I've been convinced to do a re-write of our current system and decided to
try nagios instead of netsaint, mainly because of downtime retention.  To
test downtime retention, I setup a test device with a dummy host-alive
check and 2 service checks, one being a ping:

define service{
        use             generic-service;
        host_name       cisco-temp;
        is_volatile     0;
        check_period    24x7;
        contact_groups  dummy-group;
        notification_period     24x7;
        notification_interval   180;
        notification_options    c,r;
        service_description     PING;
        max_check_attempts      5;
        normal_check_interval   8;
        retry_check_interval    2;
        event_handler   switch-down-event-handler;
        check_command   check_ping;
        }

I setup a dummy-group to send email to ( because I don't want
notifications of a host/service down being sent directly to an email
address..  imagine if 100 hosts suddenly went down.. 100 emails )..
instead, I always call an event handler to perform some action ( ie. allow
host notifications to queue up and send them in bulk every 15 minutes or
so. )

1st problem: I then add a host downtime entry for this device, unhook it
from the network so the ping service check will fail, and I continue to
get notifications from the event handler for that service.  (so, am I
correct in assuming that just because you set a host downtime entry for a
device that the service checks will keep sending notifications?  Why is
this?  If a host is down, then services will be down)

2nd problem: I then add a service downtime entry for this device, unhook
it from the network again so that the ping service check will fail, and I
still continue to get notifications from the event handler for that
service.  (so, am I correct in assuming that downtime for a service only
affects email notifications and not event handler notifications?)

3rd problem: One problem with netsaint, which I still see in nagios, is
the lack of a tool to bulk manage a number of devices.  (ie, if I know a
building is going to loose power from 08:00 to 14:00, then I want to
schedule downtime for all devices in that building.  With the current
process of setting downtime, this is rather tedius.  I'll probably get
around it by writing my own interface to do this (as I did with netsaint),
but this would be a great feature to add to your wishlist.

Question: assuming downtime scheduling works correctly, would it be
possible to put downtime data into mysql and point 2 different nagios
servers at the database (the servers would be monitoring the same devices,
so, essentially, they would have the same downtime schedules.

Thanks for listening.. I think nagios is a step in the right direction as
far as network/host monitoring goes.

Brian

--
Brian Wilson  <wilson at ncsu.edu>      Network Analyst
Communication Technologies, ATD      W: 919.513.3472
North Carolina State University      www.ncstate.net



-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390




More information about the Users mailing list