Help optimizing nagios for when things go wrong

George Bryan georgiebryan at gmail.com
Fri Jun 15 17:37:15 CEST 2007


Hi all,

I'm having a big problem with nagios service check latencies.

When the network is in normal operation, with few hosts or services in
critical states, everything goes smoothly, however, today we had some
problems and there were a large amount of nrpe checks timing out after
10 seconds.

In this situation the service check latency sky rocketed to around 12
minutes! We have a grapher system, parsing perfomance data into RRD
databases, and due to this latency the rrd databases weren't being
updated. This happened for a few hours, not just while notiications
were being sent, until I removed the services from the config files.

Does anyone have any tip on how i can prevent this from happening?

Below you can find the output  of nagios -s and nagiostats. I don't
have a nagiostats output when we had lots of critical services. But i
can try and reproduce the conditions if it is of use. I'm very
interested in keeping the latencies to a minimum, even if things go
havoc!

Thanks all.

=========== /usr/nagios/bin/nagios -s /etc/nagios/nagios.cfg ===========
Nagios 2.8
Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org)
Last Modified: 04-10-2007
License: GPL

Projected scheduling information for host and service
checks is listed below.  This information assumes that
you are going to start running Nagios with your current
config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     433
Total scheduled hosts:           2
Host inter-check delay method:   SMART
Average host check interval:     86400.00 sec
Host inter-check delay:          900.00 sec
Max host check spread:           30 min
First scheduled check:           Fri Jun 15 16:23:29 2007
Last scheduled check:            Fri Jun 15 16:38:29 2007


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     2126
Total scheduled services:           2126
Service inter-check delay method:   SMART
Average service check interval:     300.00 sec
Inter-check delay:                  0.14 sec
Interleave factor method:           SMART
Average services per host:          4.91
Service interleave factor:          5
Max service check spread:           30 min
First scheduled check:              Fri Jun 15 16:24:29 2007
Last scheduled check:               Fri Jun 15 16:29:29 2007


CHECK PROCESSING INFORMATION
----------------------------
Service check reaper interval:      10 sec
Max concurrent service checks:      Unlimited


PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.


=========== /usr/nagios/bin/nagiostats ===========
Nagios Stats 2.8
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 04-10-2007
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File:                          /var/nagios/status.log
Status File Age:                      0d 0h 0m 25s
Status File Version:                  2.8

Program Running Time:                 0d 0h 16m 31s
Nagios PID:                           26886
Used/High/Total Command Buffers:      0 / 0 / 4096
Used/High/Total Check Result Buffers: 134 / 134 / 4096

Total Services:                       2126
Services Checked:                     2126
Services Scheduled:                   2126
Active Service Checks:                2126
Passive Service Checks:               0
Total Service State Change:           0.000 / 28.950 / 0.096 %
Active Service Latency:               18.980 / 120.190 / 66.213 sec
Active Service Execution Time:        0.079 / 60.078 / 1.633 sec
Active Service State Change:          0.000 / 28.950 / 0.096 %
Active Services Last 1/5/15/60 min:   141 / 1621 / 2126 / 2126
Passive Service State Change:         0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:  0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:            2101 / 2 / 0 / 23
Services Flapping:                    0
Services In Downtime:                 0

Total Hosts:                          433
Hosts Checked:                        433
Hosts Scheduled:                      2
Active Host Checks:                   433
Passive Host Checks:                  0
Total Host State Change:              0.000 / 32.110 / 0.528 %
Active Host Latency:                  0.000 / 316.979 / 1.378 sec
Active Host Execution Time:           0.070 / 2.638 / 2.585 sec
Active Host State Change:             0.000 / 32.110 / 0.528 %
Active Hosts Last 1/5/15/60 min:      0 / 15 / 31 / 120
Passive Host State Change:            0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                428 / 5 / 0
Hosts Flapping:                       0
Hosts In Downtime:                    0

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list