High check latency in a machine with low load

Daniel Wittenberg daniel.wittenberg.r0ko at statefarm.com
Tue Oct 11 15:44:37 CEST 2011


I think you have the enable_high_latency option enabled :)  j/k

Do you have any particular checks that are taking a long time?  i.e. can you watch top and see checks taking a while?

Dan


From: Javier Vela Diago [mailto:jvela at s2grupo.es]
Sent: Tuesday, October 11, 2011 6:23 AM
To: nagios-users at lists.sourceforge.net
Subject: [Nagios-users] High check latency in a machine with low load

Hi,

I have a Nagios 3.2.3 deployment with 1000+ Hosts and 3000+ services. This Nagios runs together with NDO and PNP (in bulk mode) in a server with 4GB of Ram and 4 cpus.

One day I realized that the check delay in the performance CGI was very high (300-400 seconds). It was very strange so I took the tunning guide form nagios (http://nagios.sourceforge.net/docs/3_0/tuning.html) and applied all the points I could. In particular I adjusted the max_concurrent_checks to zero (no limit):

max_concurrent_checks=0

The reaper event:

service_reaper_frequency=5
max_check_result_reaper_time=15

and checked that the host checks where not forced. In addition I configured 15 seconds of host check cache.

cached_host_check_horizon=15

But the problem remains. And the load of the server is not very high. Load of 2,5, 2 GB of free memory and an average utilization of disc of 7%. I disabled NDO and PNP but it was useless. After the first round of checks, the delay returns, while the load of the server doesn't grow.

I have searched in google but all the problems area because of the load in the server, but here this is not the main problem. So my question is ¿what can I do now?¿There is some variable that shows me where to look? I'm a bit lost right now and I don't know how to find the problem.

¿Or maybe the only way is to configure a master-slave nagios in order to maximize the server utilization?

In addition, I have pretty big timeouts (60 seconds) because of the high latency on the network. All your help is appreciated. Thank you in advance.

nagiostats
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/argos/aplicaciones/nagios/var/status.dat
Status File Age:                        0d 0h 0m 11s
Status File Version:                    3.2.3

Program Running Time:                   0d 20h 56m 7s
Nagios PID:                             21834
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         4032
Services Checked:                       4032
Services Scheduled:                     4030
Services Actively Checked:              4032
Services Passively Checked:             0
Total Service State Change:             0.000 / 37.300 / 0.163 %
Active Service Latency:                 32.876 / 442.138 / 415.816 sec
Active Service Execution Time:          0.051 / 60.097 / 1.545 sec
Active Service State Change:            0.000 / 37.300 / 0.163 %
Active Services Last 1/5/15/60 min:     237 / 1530 / 4020 / 4020
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              3766 / 38 / 44 / 184
Services Flapping:                      0
Services In Downtime:                   0

Total Hosts:                            931
Hosts Checked:                          931
Hosts Scheduled:                        931
Hosts Actively Checked:                 931
Host Passively Checked:                 0
Total Host State Change:                0.000 / 12.370 / 0.077 %
Active Host Latency:                    0.000 / 441.308 / 416.063 sec
Active Host Execution Time:             0.062 / 10.113 / 0.395 sec
Active Host State Change:               0.000 / 12.370 / 0.077 %
Active Hosts Last 1/5/15/60 min:        74 / 423 / 931 / 931
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  897 / 24 / 10
Hosts Flapping:                         0
Hosts In Downtime:                      1

Active Host Checks Last 1/5/15 min:     109 / 535 / 1583
   Scheduled:                           87 / 433 / 1300
   On-demand:                           22 / 102 / 283
   Parallel:                            87 / 438 / 1323
   Serial:                              0 / 0 / 0
   Cached:                              22 / 97 / 260
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  304 / 1605 / 4924
   Scheduled:                           304 / 1605 / 4923
   On-demand:                           0 / 0 / 1
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

nagios -s

Nagios Core 3.2.3
Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 10-03-2010
License: GPL

Website: http://www.nagios.org<http://www.nagios.org/>
Warning: aggregate_status_updates directive ignored.  All status file updates are now aggregated.
Warning: downtime_file variable ignored.  Downtime entries are now stored in the status and retention files.
Warning: comment_file variable ignored.  Comments are now stored in the status and retention files.
Timing information on object configuration processing is listed
below.  You can use this information to see if precaching your
object configuration would be useful.

Object Config Source: Config files (uncached)

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
----------------------------------
Read:                 0.080036 sec
Resolve:              0.010660 sec  *
Recomb Contactgroups: 0.002666 sec  *
Recomb Hostgroups:    0.004086 sec  *
Dup Services:         0.034632 sec  *
Recomb Servicegroups: 0.001277 sec  *
Duplicate:            0.010939 sec  *
Inherit:              0.005594 sec  *
Recomb Contacts:      0.000001 sec  *
Sort:                 0.000000 sec  *
Register:             0.074413 sec
Free:                 0.008730 sec
                      ============
TOTAL:                0.234920 sec  * = 0.071741 sec (30.54%) estimated savings


RETENTION DATA TIMES
----------------------------------
Read and Process:     0.495480 sec
                      ============
TOTAL:                0.495480 sec


Timing information on configuration verification is listed below.

CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x option)
----------------------------------
Object Relationships: 0.060039 sec
Circular Paths:       0.026557 sec  *
Misc:                 0.005999 sec
                      ============
TOTAL:                0.092595 sec  * = 0.026557 sec (28.7%) estimated savings


EVENT SCHEDULING TIMES
-------------------------------------
Get service info:        0.014509 sec
Get host info info:      0.002853 sec
Get service params:      0.000078 sec
Schedule service times:  0.039947 sec
Schedule service events: 0.034656 sec
Get host params:         0.000001 sec
Schedule host times:     0.007519 sec
Schedule host events:    0.029519 sec
                         ============
TOTAL:                   0.129082 sec


Projected scheduling information for host and service checks
is listed below.  This information assumes that you are going
to start running Nagios with your current config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     931
Total scheduled hosts:           931
Host inter-check delay method:   SMART
Average host check interval:     259.01 sec
Host inter-check delay:          0.28 sec
Max host check spread:           30 min
First scheduled check:           Tue Oct 11 13:14:08 2011
Last scheduled check:            Tue Oct 11 13:18:26 2011


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     4032
Total scheduled services:           4030
Service inter-check delay method:   SMART
Average service check interval:     299.55 sec
Inter-check delay:                  0.07 sec
Interleave factor method:           SMART
Average services per host:          4.33
Service interleave factor:          5
Max service check spread:           30 min
First scheduled check:              Tue Oct 11 13:15:07 2011
Last scheduled check:               Tue Oct 11 13:20:07 2011


CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval:       5 sec
Max concurrent service checks:      Unlimited


PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
--
Javier Vela Diago
S2 GRUPO
Ramiro de Maeztu, 7 bajo. 46022 Valencia
Tel: 963.110.300 Fax: 963.106.086
e-mail : jvela arroba s2grupo punto es
http://www.s2grupo.es<http://www.s2grupo.es/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20111011/b44a2702/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list