trying to fix problem with excessive latency

Corey Hickey bugfood-ml at fatooh.org
Wed May 19 03:29:56 CEST 2010


Hello,

I have inherited maintenance of a medium-sized Nagios installation. We 
currently have 649 hosts and 5415 services. Our setup works nicely, with 
one exception: Nagios falls behind on host/service checks. Our usual 
latency once Nagios has been running for a while is about 190-200 
seconds. Our Nagios host is reasonably powerful and isn't struggling; it 
seems that Nagios itself is limited somehow.

I've searched google and read every relevant document I could find, 
including the tuning page:

http://nagios.sourceforge.net/docs/3_0/tuning.html

So far I haven't been able to find anything wrong with our 
configuration, and my experimental tuning hasn't resulted in any 
improvement. As far as I can tell, Nagios is scheduling the host/service 
checks properly, but not processing the queue aggressively enough.

Some notes:

1. The Nagios host has 8 2GHz cores and is usually 75-85% idle. Out of 4 
GB of memory, 1.2 GB is free, with no swap usage. We don't seem to be 
running into any physical limitations.

2. Raising max_concurrent_checks doesn't help; 'nagios -s' recommends a 
value of at least 599, so we're using 1200. I've tried absurdly high 
values like 6000, with no improvement.

3. Lowering service_reaper_frequency to 2 doesn't seem to help; in any 
case, our latency of 190 is way higher than the service_reaper_frequency.

4. I tried setting max_check_result_reaper_time to 30; no change. I 
don't know what I should set this to.

5. I tried disabling all host check scheduling (setting check_interval 
to 0 in our host template); that may have helped (I'm seeing 173 second 
latency instead of 190) but didn't really solve the problem.

I'm attaching our main nagios.cfg file and including the output of 
nagiostats below.

The host is running 64-bit CentOS 5.4 with a 2.6.18 kernel.

-----------------------------------------------------------------------
Nagios Stats 3.2.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 03-09-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/log/nagios/status.log
Status File Age:                        0d 0h 0m 6s
Status File Version:                    3.2.1

Program Running Time:                   0d 0h 18m 22s
Nagios PID:                             1556
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         5415
Services Checked:                       5415
Services Scheduled:                     5415
Services Actively Checked:              5415
Services Passively Checked:             0
Total Service State Change:             0.000 / 30.390 / 0.024 %
Active Service Latency:                 5.878 / 197.462 / 194.633 sec
Active Service Execution Time:          0.020 / 120.007 / 0.847 sec
Active Service State Change:            0.000 / 30.390 / 0.024 %
Active Services Last 1/5/15/60 min:     767 / 4236 / 5412 / 5415
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              5358 / 6 / 0 / 51
Services Flapping:                      1
Services In Downtime:                   22

Total Hosts:                            649
Hosts Checked:                          649
Hosts Scheduled:                        649
Hosts Actively Checked:                 649
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.000 / 196.614 / 194.274 sec
Active Host Execution Time:             0.020 / 11.019 / 0.069 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        91 / 506 / 649 / 649
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  646 / 3 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     101 / 536 / 1609
    Scheduled:                           98 / 520 / 1562
    On-demand:                           3 / 16 / 47
    Parallel:                            99 / 522 / 1566
    Serial:                              0 / 0 / 0
    Cached:                              3 / 15 / 44
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  872 / 4360 / 13101
    Scheduled:                           872 / 4360 / 13101
    On-demand:                           0 / 0 / 0
    Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0
-----------------------------------------------------------------------

I have a feeling I'm missing something.... I would appreciate any 
suggestions.

Thanks,
Corey
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nagios.cfg
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100518/e75cc59b/attachment.ksh>
-------------- next part --------------
------------------------------------------------------------------------------

-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list