Escalations and latency

andy at droidmcse.com andy at droidmcse.com
Wed Nov 6 19:44:51 CET 2002


Hey Gang!

Ethan - Awesome program.  Great work!

I have a dual 1.4ghz Compaq Server w/ 2gigs of memory.  The sole purpose
of this box is to run Nagios.

Here is my observation and I'm hoping that someone can offer up a solution
or at least an explanation of why this is the way it is:

I have about 450 hosts and little over 1300 service checks occurring.  75%
of those are a standard set of NT checks - cpu load, memory, disk space,
etc.  Outside of those checks, I'm doing specific checks for things like
web servers.  Notice I have contact_group set to nt-admins-tier1.

define service{
        use                             generic-service         ; Name of
service template to use

        host_name                       usmnli05,usmnli06
        service_description             Promo Planning
        max_check_attempts              2
        retry_check_interval            1
        contact_groups                  nt-admins-tier1
        notification_options            w,u,c,r
        check_command                   "$USER1$/check_http -H
$HOSTADDRESS$ -u /promoplanning/asp/home.asp"
        }

define serviceescalation{
        host_name               usmnli05,usmnli06
        service_description     Promo Planning
        contact_groups          nt-admins-tier1
        first_notification      1
        last_notification       4
        notification_interval   30
        }

define serviceescalation{
        host_name               usmnli05,usmnli06
        service_description     Promo Planning
        contact_groups          nt-admins-tier2
        first_notification      2
        last_notification       4
        notification_interval   30
        }

define serviceescalation{
        host_name               usmnli05,usmnli06
        service_description     Promo Planning
        contact_groups          nt-admins-tier3
        first_notification      3
        last_notification       3
        notification_interval   30
        }

define serviceescalation{
        host_name               usmnli05,usmnli06
        service_description     Promo Planning
        contact_groups          nt-admins-tier1
        first_notification      4
        last_notification       4
        notification_interval   30
        }

My intention is to notify the nt tier structure at 30 minute intervals 4
times.  On the 3rd attempt, I generate an email message that logs a help
desk ticket.  This is a cover my a** attempt.  If sendpage has died at
least I am pushing off the job on our help desk to get the ticket logged
and resolved.

Now that I think I've painted a semi-clear picture of my intentions, here
is my problem:

[root at mnmslx11 etc]# ../bin/nagios -s ./nagios.cfg
SERVICE SCHEDULING INFORMATION
        -------------------------------
        Total services:             1314
        Total hosts:                452

Rough guidelines for max_concurrent_checks value:
        -------------------------------------------------
        Absolute minimum value:     12
        Recommend value:            36

According to this information, I only need to execute 36 checks
simultaneously to get all of my checks done.

Immediately after I start nagios, and it starts running the checks, my
latency increases to a specific point that appears to hold steady.  It
works it way steadily up to Min:120, Max:355,Avg:250.  And it holds steady
there.

I just checked the service_check_timeout.  I changed it from 60 down to 20
and it has helped dramatically.  I have also changed the
service_reaper_frequency from 10 to 5.  However, my box is still pushing
the max number of checks which I have set a 400.

What this is telling me is that the service checks aren't getting dumped
when they finish.  They are taking an average of .8 seconds to complete,
but they are not going away to make room for the next check.  If I'm
utilizing my very poor math skills correctly, .8 seconds and 400 checks at
a time - I should be able to complete 400 checks (approx) in 1 minute. 
With the vast majority of my checks being 5 minute intervals, there should
be plenty of breathing room to complete the 1300 checks in a 5 minute
window.

Someone - please chime in and offer up some advice.  Any suggestions would
be greatly appreciated.

Thanks!
Andy





-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm 
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en




More information about the Users mailing list