Lots of hosts, only a couple of services?

Cook, Garry GWCOOK at mactec.com
Tue Aug 24 18:55:31 CEST 2004
Previous message: Lots of hosts, only a couple of services?
Next message: Lots of hosts, only a couple of services?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
nagios-users-admin at lists.sourceforge.net wrote:
> Good morning, fellow Nagios users!
> 
> Right now, we have about 250 hosts.  But there are only 6 of them
> monitoring services like DNS, HTTP, SMTP, and POP3.  All the rest are
> being pinged only, since they are very simple network
> devices.  Switches
> and wireless access points, mainly.  These devices DO have telnet and
> generally web interfaces, though.

So, you can also check telnet and http. Most likely they have SNMP
variables that can be checked with check_snmp as well.


> Here's part of the problem: If any device misses a single
> service check,
> a host check is immediately triggered.  But sometimes a
> device can miss
> a ping even though there is no problem, just a burst of
> network traffic.
>   Unfortunately, the service checks do not respect the
> max_check_attempts in this regard.  Instead, after any single missed
> service check, a host check is immediately triggered.  AND,
> if that host
> check also fails -- quite likely if the other ping just failed a few
> seconds ago -- then a notification is sent out immediately.  Again
> ignoring the max_check_attempts value.  I have already confirmed that
> this behaviors is by design! 

I don't think this is entirely correct. You should understand the
difference between service checks and host checks. 'max_check_attempts'
applies to service checks in the service definitions and host checks in
the host definitions.

Here is the ping check that I run on all of my hosts, this is from my
services.cfg file:

# router-ping definition
define service{
        name                            service-ping

        use                             generic-service

        host_name                       *
        service_description             PING
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           15
        retry_check_interval            5
        notification_interval           60
        notification_period             daytime
        notification_options            c,r
        check_command                   check_ping
        
        register                        1  
        }

So, if a PING is in a warn or crit state, it will need to be in that
state for at least 25 minutes (possibly up to 39 minutes) before a
notification is sent. This is due to the fact that it is checked every
15 minutes. If it fails the first check, it is checked two more times at
five minute intervals before changing to a HARD state.

However, after the first failed check, the host check is run. Lets say a
PING service check fails on one of my routers. It would trigger the
'check-router-alive' command defined in my host entry from my hosts.cfg
file. Here is the actual check from my checkcommands.cfg file:

define command{
        command_name    check-router-alive
        command_line    $USER1$/check_ping $HOSTADDRESS$ 100 100 5000.0
5000.0 -p 1
        }

This shows that to check if a router is up/down, we use check ping, send
one packet, and only 100% packet loss or a 5 second RTA is considered
warning or critical. This avoids almost all false positives due to heavy
bursts of traffic. Also, depending upon the version of the plugin that
you are using, your command line may look different than mine.

I'm not convinced that Nagios will not respect the max_check_attempts
(or is it max_attempts) defined in my host definition. But can't seem to
locate it in the docs at the moment, so you may be right about this.
Either way, if the host is down you'll want to know right away, so you
could set the max_check_attempts to 1 and tweak the plugin arguments to
help avoid false positives.

Please see the following docs for more information:
http://nagios.sourceforge.net/docs/1_0/checkscheduling.html
http://nagios.sourceforge.net/docs/1_0/templatetricks.html


> Right now, I'm using a dummy check for host checks.  That
> took care of
> the problem where it was immediately triggering a
> notification, if any
> device missed one service check.  But the problem is that now
> our host
> status map doesn't show any of the problems, everything there
> is always
> green.

You don't want to use a dummy host check. Use a real check and tweak it
to fit your needs. If check_ping won't do what you want it to do then
use check_http, or even check_snmp to check one of the oids (CPU, MEM,
etc).


> So here's my question: how can I improve our Nagios setup?
> 
> Here are my goals:
> 1) Prevent false positives with max_check_attempts (set to 5)
> 2) Get Nagios to respect max_check_attempts
> 3) Have the Status Map correctly show situation if any
> devices are down.

1) Tweak your host check
2) AFAIK, Nagios does respect max_check_attempts. You mentioned above
that you have confirmed that by design it does not. Can you point me in
the direction of the docs that confirmed this?
3) Don't use a dummy host check. Use a real working check.


> Could I...
> 1) Check telnet instead of just pinging these devices?  (And
> change the
> host checks back to the regular host_check_alive?)
> 2) Not check services at all, unless necessary, and only do
> host checks?
>   (Nagios throws lots of warnings if you do this, and I suppose I'd
> rather avoid that) 3) ...?  (Profit?)

1) Sure
2) No, Nagios will not run host checks unless a service check fails.
3) Profit? I'm not sure what you're asking here. Contact me off-list if
you want to help me move some money out of Nigeria. Or visit
http://findopensourcesupport.com.

 
> I haven't written my own plugins yet, so I'm trying to figure out how
> hard it'd be to check telnet.  The devices are different
> enough that I
> doubt I can count on very similar responses from any telnet
> attempts... 
> 
> Suggestions?  Advice?  Ideas?
> 
> Thanks very much for anything you can offer!

I thought that there was a check_telnet plugin somewhere, but after
poking around I can't seem to locate it. If you plan to write your own
plugin, there are developer guidelines here:
http://nagiosplug.sourceforge.net/developer-guidelines.html#AEN185.

Garry W. Cook, CCNA
Network Infrastructure Manager
MACTEC, Inc. - http://www.mactec.com/
303.308.6228 (Office) - 720.220.1862 (Mobile)


-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Lots of hosts, only a couple of services?
Next message: Lots of hosts, only a couple of services?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list